1

You can concatenate specific column values in a multi-column duplicate row by doing the following, but all columns other than those specified in the groupby will disappear.

The columns title, thumbnail, name, created_at are present.

Retrieve the rows with duplicate title and thumbnail as follows and concatenate the values of the name column of the duplicated row while keeping the first row. I want to create a new column for it.

However, as mentioned earlier, columns other than those specified in groupby will disappear.

df.groupby(['title', 'thumbnail'])['name'].apply(lambda x: ' '.join(x)).reset_index()
1
  • 2
    Can you please provide some examples to make the question clearer? Commented Nov 22, 2021 at 19:22

3 Answers 3

2

Suppose the following dataframe:

>>> df
    title thumbnail   name   created_at
0  title1    thumb1  name1        today
1  title1    thumb1  name2    yesterday
2  title1    thumb2  name3  another day

The output of your code is:

>>> df.groupby(['title', 'thumbnail'], as_index=False)['name'] \
      .apply(' '.join)
    title thumbnail         name
0  title1    thumb1  name1 name2
1  title1    thumb2        name3

If you don't want to lost columns and rows (keep the shape), use transform:

df['name'] = df.groupby(['title', 'thumbnail'])['name'] \
               .transform(' '.join)
print(df)

# Output:
    title thumbnail         name   created_at
0  title1    thumb1  name1 name2        today
1  title1    thumb1  name1 name2    yesterday
2  title1    thumb2        name3  another day

Else you have to make a choice with other columns to keep them. In this case, do you want to keep 'today' or 'yesterday' for created_at? To do that, you can use agg:

>>> df.groupby(['title', 'thumbnail']) \
      .agg({'name': ' '.join, 'created_at': 'first'}) \
      .reset_index()

    title thumbnail         name   created_at
0  title1    thumb1  name1 name2        today
1  title1    thumb2        name3  another day

Setup:

data = {'title': ['title1', 'title1', 'title1'],
        'thumbnail': ['thumb1', 'thumb1', 'thumb2'],
        'name': ['name1', 'name2', 'name3'],
        'created_at': ['today', 'yesterday', 'another day']}
df = pd.DataFrame(data)
Sign up to request clarification or add additional context in comments.

6 Comments

Use df.assign if you don't want the assignment to be in-place.
I don't use assign for readability. I just want to illustrate only groupby_apply, groupby_transform and groupby_agg.
Replace lambda x: ' '.join(x) with ' '.join
@rudolfovic. Good point! Thanks.
It's because you choose to keep the first row which is equivalent to 'first'. You can use transform then drop_duplicates to get the same result (which is more clearer than your one line)
|
1

That's because you're selecting the name column via [''], so by definition, the only columns available are going to be the columns that make up the index (which is required) and the column you're selecting.

Instead of calling apply on the ["name"] column of the groupby, call apply directly on the groupby:

df.groupby(['title', 'thumbnail']).apply(lambda x: ' '.join(x['name'])).reset_index()

2 Comments

this doesn't solve the problem of losing additional columns - have a look at my solution
@Corralien please explain exactly what you are getting and how
1

Using a toy DataFrame for illustration:

df = pd.DataFrame({
    'title': ['tom', 'tom', 'tom', 'mark', 'mark', 'lewis'],
    'name': list('abcdef'),
    'marks': [55, 99, 14, 28, 19, 88]
})

In any case we will need to group:

groups = df.groupby(['title', 'thumbnail'])

Here is a neat solution using a join:

groups.first().join(groups['name'].agg(' '.join), rsuffix='s')

A more efficient solution would get the name aggregation and the rest of the columns in a single pass:

def process(group):
  result = group.iloc[0] # take the first row
  # then add a concatenation of all names for this group
  result['schools'] = ' '.join(group['name'])
  # return the result data frame with a single row
  return result

This could also be done in a single line:

def process(group):
  return group.iloc[[0]].assign(names=' '.join(group['name']))

Then simply apply the helper function to all the groups:

groups.apply(process)

The two methods get the same results:

       title   name  marks   names
name                             
lewis  lewis      f     88       f
mark    mark      d     28     d e
tom      tom      a     55   a b c

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.