python, pandas: How to specify multiple columns and merge only specific columns of duplicate rows

Question

You can concatenate specific column values in a multi-column duplicate row by doing the following, but all columns other than those specified in the groupby will disappear.

The columns title, thumbnail, name, created_at are present.

Retrieve the rows with duplicate title and thumbnail as follows and concatenate the values of the name column of the duplicated row while keeping the first row. I want to create a new column for it.

However, as mentioned earlier, columns other than those specified in groupby will disappear.

df.groupby(['title', 'thumbnail'])['name'].apply(lambda x: ' '.join(x)).reset_index()

Can you please provide some examples to make the question clearer? — A.Najafi
– A.Najafi, Commented Nov 22, 2021 at 19:22

Corralien · Accepted Answer · 2021-11-22 20:05:12Z

2

Suppose the following dataframe:

>>> df
    title thumbnail   name   created_at
0  title1    thumb1  name1        today
1  title1    thumb1  name2    yesterday
2  title1    thumb2  name3  another day

The output of your code is:

>>> df.groupby(['title', 'thumbnail'], as_index=False)['name'] \
      .apply(' '.join)
    title thumbnail         name
0  title1    thumb1  name1 name2
1  title1    thumb2        name3

If you don't want to lost columns and rows (keep the shape), use transform:

df['name'] = df.groupby(['title', 'thumbnail'])['name'] \
               .transform(' '.join)
print(df)

# Output:
    title thumbnail         name   created_at
0  title1    thumb1  name1 name2        today
1  title1    thumb1  name1 name2    yesterday
2  title1    thumb2        name3  another day

Else you have to make a choice with other columns to keep them. In this case, do you want to keep 'today' or 'yesterday' for created_at? To do that, you can use agg:

>>> df.groupby(['title', 'thumbnail']) \
      .agg({'name': ' '.join, 'created_at': 'first'}) \
      .reset_index()

    title thumbnail         name   created_at
0  title1    thumb1  name1 name2        today
1  title1    thumb2        name3  another day

Setup:

data = {'title': ['title1', 'title1', 'title1'],
        'thumbnail': ['thumb1', 'thumb1', 'thumb2'],
        'name': ['name1', 'name2', 'name3'],
        'created_at': ['today', 'yesterday', 'another day']}
df = pd.DataFrame(data)

edited Nov 22, 2021 at 20:05

answered Nov 22, 2021 at 19:40

Corralien

121k8 gold badges43 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user17242583 Over a year ago

Use df.assign if you don't want the assignment to be in-place.

Corralien Over a year ago

I don't use assign for readability. I just want to illustrate only groupby_apply, groupby_transform and groupby_agg.

rudolfovic Over a year ago

Replace lambda x: ' '.join(x) with ' '.join

Corralien Over a year ago

@rudolfovic. Good point! Thanks.

Corralien Over a year ago

It's because you choose to keep the first row which is equivalent to 'first'. You can use transform then drop_duplicates to get the same result (which is more clearer than your one line)

|

user17242583 · Accepted Answer · 2021-11-22 19:37:19Z

1

That's because you're selecting the name column via [''], so by definition, the only columns available are going to be the columns that make up the index (which is required) and the column you're selecting.

Instead of calling apply on the ["name"] column of the groupby, call apply directly on the groupby:

df.groupby(['title', 'thumbnail']).apply(lambda x: ' '.join(x['name'])).reset_index()

answered Nov 22, 2021 at 19:37

user17242583

2 Comments

rudolfovic Over a year ago

this doesn't solve the problem of losing additional columns - have a look at my solution

rudolfovic Over a year ago

@Corralien please explain exactly what you are getting and how

rudolfovic · Accepted Answer · 2021-11-22 20:53:37Z

Using a toy DataFrame for illustration:

df = pd.DataFrame({
    'title': ['tom', 'tom', 'tom', 'mark', 'mark', 'lewis'],
    'name': list('abcdef'),
    'marks': [55, 99, 14, 28, 19, 88]
})

In any case we will need to group:

groups = df.groupby(['title', 'thumbnail'])

Here is a neat solution using a join:

groups.first().join(groups['name'].agg(' '.join), rsuffix='s')

A more efficient solution would get the name aggregation and the rest of the columns in a single pass:

def process(group):
  result = group.iloc[0] # take the first row
  # then add a concatenation of all names for this group
  result['schools'] = ' '.join(group['name'])
  # return the result data frame with a single row
  return result

This could also be done in a single line:

def process(group):
  return group.iloc[[0]].assign(names=' '.join(group['name']))

Then simply apply the helper function to all the groups:

groups.apply(process)

The two methods get the same results:

       title   name  marks   names
name                             
lewis  lewis      f     88       f
mark    mark      d     28     d e
tom      tom      a     55   a b c

Collectives™ on Stack Overflow

python, pandas: How to specify multiple columns and merge only specific columns of duplicate rows

3 Answers 3

6 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related