Get the row(s) which have the max value in groups using groupby

Question

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?

Example 1: the following DataFrame:

   Sp   Mt Value   count
0  MM1  S1   a     **3**
1  MM1  S1   n       2
2  MM1  S3   cb    **5**
3  MM2  S3   mk    **8**
4  MM2  S4   bg    **10**
5  MM2  S4   dgd     1
6  MM4  S2   rd      2
7  MM4  S2   cb      2
8  MM4  S2   uyi   **7**

Expected output is to get the result rows whose count is max in each group, like this:

   Sp   Mt   Value  count
0  MM1  S1   a      **3**
2  MM1  S3   cb     **5**
3  MM2  S3   mk     **8**
4  MM2  S4   bg     **10** 
8  MM4  S2   uyi    **7**

Example 2:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
5  MM2  S4   dgd    1
6  MM4  S2   rd     2
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

Expected output:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

This answer is the fastest solution I could find: stackoverflow.com/a/21007047/778533 — tommy.carstensen
– tommy.carstensen, Commented Mar 26, 2017 at 2:17

wjandrea · Accepted Answer · 2023-02-18 21:20:59Z

609

Firstly, we can get the max count for each group like this:

In [1]: df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7

In [2]: df.groupby(['Sp', 'Mt'])['count'].max()
Out[2]:
Sp   Mt
MM1  S1     3
     S3     5
MM2  S3     8
     S4    10
MM4  S2     7
Name: count, dtype: int64

To get the indices of the original DF you can do:

In [3]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7

Note that if you have multiple max values per group, all will be returned.

Update

On a Hail Mary chance that this is what the OP is requesting:

In [5]: df['count_max'] = df.groupby(['Sp', 'Mt'])['count'].transform(max)

In [6]: df
Out[6]:
    Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          5
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7

edited Feb 18, 2023 at 21:20

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Mar 29, 2013 at 15:09

Zelazny7

40.7k18 gold badges72 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

3pitt Over a year ago

@Zelazny7 I'm using the second, idx approach. But, I can only afford to a single maximum for each group (and my data has a few duplicate-max's). is there a way to get around this with your solution?

Woods Chen Over a year ago

transform method may have pool performance when the data set is large enough, get the max value first then merge the dataframes will be better.

Prakash Vanapalli Over a year ago

As @3pitt mentioned, this is wrong for the original question asked.

Zelazny7 Over a year ago

@PrakashVanapalli no it isn't

citynorman Over a year ago

In never versions need to do transform('max')

Rani · Accepted Answer · 2016-11-16 10:14:22Z

329

You can sort the dataFrame by count and then remove duplicates. I think it's easier:

df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])

answered Nov 16, 2016 at 10:14

Rani

6,8322 gold badges27 silver badges33 bronze badges

9 Comments

Nolan Conaway Over a year ago

Very nice! Fast with largish frames (25k rows)

Tyler Over a year ago

For those who are somewhat new with Python, you will need to assign this to a new variable, it doesn't change the current df variable.

TMrtSmith Over a year ago

@Samir or use inplace = True as an argument to drop_duplicates

Woods Chen Over a year ago

This is a great answer when need only one of rows with the same max values, however it wont work as expected if I need all the rows with max values.

Woods Chen Over a year ago

I mean if the dataframe is pd.DataFrame({'sp':[1, 1, 2], 'mt':[1, 1, 2], 'value':[2, 2, 3]}, then there will be 2 rows with the same max value 2 in the group where sp==1 and mt==2. @Rani

|

wjandrea · Accepted Answer · 2023-02-18 21:35:54Z

127

Easy solution would be to apply the idxmax() function to get indices of rows with max values. This would filter out all the rows with max value in the group.

In [367]: df
Out[367]: 
    sp  mt  val  count
0  MM1  S1    a      3
1  MM1  S1    n      2
2  MM1  S3   cb      5
3  MM2  S3   mk      8
4  MM2  S4   bg     10
5  MM2  S4  dgb      1
6  MM4  S2   rd      2
7  MM4  S2   cb      2
8  MM4  S2  uyi      7


# Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]: 
    sp  mt  val  count
0  MM1  S1    a      3
2  MM1  S3   cb      5
3  MM2  S3   mk      8
4  MM2  S4   bg     10
8  MM4  S2  uyi      7


# Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])

edited Feb 18, 2023 at 21:35

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Jul 7, 2017 at 0:40

Surya Chhetri

11.7k4 gold badges61 silver badges39 bronze badges

2 Comments

Max Power Over a year ago

The questioner here specified "I want to get ALL the rows where count equals max in each group", while idxmax Return[s] index of first occurrence of maximum over requested axis" according to the docs (0.21).

Carlos Souza Over a year ago

This is a great solution, but for a different problem

blackraven · Accepted Answer · 2022-08-04 20:53:40Z

82

You may not need to do groupby(), but use both sort_values + drop_duplicates

df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

Also almost same logic by using tail

df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

edited Aug 4, 2022 at 20:53

blackraven

5,6797 gold badges27 silver badges51 bronze badges

answered Jan 4, 2019 at 14:55

BENY

324k22 gold badges176 silver badges250 bronze badges

5 Comments

Clay Over a year ago

Not only is this an order of magnitude faster than the other solutions (at least for my use case), it has the added benefit of simply chaining as part of the construction of the original dataframe.

Hunaphu Over a year ago

When you see this answer, you realize that all the others are wrong. This is clearly the way to do it. Thanks.

Antoine Over a year ago

One should add na_position="first" to sort_values in order to ignore NaNs.

John Stud Over a year ago

I found this to be fast for my DF of several million rows.

Benjamin Ziepert Over a year ago

This doesn't appear to work with ties.

landewednack · Accepted Answer · 2014-02-11 18:06:24Z

42

Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.

df = pd.DataFrame({
    'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
    'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    'count' : [3,2,5,8,10,1,2,2,7]
    })

df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})

df_grouped = df_grouped.reset_index()

df_grouped = df_grouped.rename(columns={'count':'count_max'})

df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])

df = df[df['count'] == df['count_max']]

edited Feb 11, 2014 at 18:06

answered Feb 11, 2014 at 17:54

landewednack

6031 gold badge6 silver badges9 bronze badges

7 Comments

goh Over a year ago

indeed this is much faster. transform seems to be slow for large dataset.

tommy.carstensen Over a year ago

Can you add comments to explain what each line does?

Roland Over a year ago

fwiw: I found the more elegant-looking solution from @Zelazny7 took a long time to execute for my set of ~100K rows, but this one ran pretty quickly. (I'm running a now way-obsolete 0.13.0, which might account for slowness).

Qy Zuo Over a year ago

But doing this df[df['count'] == df['count_max']] will lose NaN rows, as well as the answers above.

Gerard Over a year ago

I highly suggest to use this approach, for bigger data frames it is much faster to use .appy() or .agg().

|

wjandrea · Accepted Answer · 2023-02-18 21:40:08Z

19

Use groupby and idxmax methods:

transfer col date to datetime:

df['date'] = pd.to_datetime(df['date'])

get the index of max of column date, after groupyby ad_id:
```
idx = df.groupby(by='ad_id')['date'].idxmax()
```
get the wanted data:
```
df_max = df.loc[idx,]
```

   ad_id  price       date
7     22      2 2018-06-11
6     23      2 2018-06-22
2     24      2 2018-06-30
3     28      5 2018-06-22

edited Feb 18, 2023 at 21:40

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Jul 24, 2018 at 10:45

blueear

3032 silver badges5 bronze badges

1 Comment

wjandrea Over a year ago

date column??? This seems like the answer to a different question. Otherwise, it's a duplicate of Surya's answer and it has the same problem: in case of a tie, only the first occurrence is kept.

PAC · Accepted Answer · 2015-07-02 12:52:33Z

15

For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :

df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]

answered Jul 2, 2015 at 12:52

PAC

5,3848 gold badges41 silver badges62 bronze badges

1 Comment

wjandrea Over a year ago

This is the same solution as Zelazny7's answer. Please don't post duplicate answers.

Mauro Mascia · Accepted Answer · 2021-03-02 11:42:36Z

10

Summarizing, there are many ways, but which one is faster?

import pandas as pd
import numpy as np
import time

df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))

start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))

And the winner is...

--1 ) 0.03337574005126953 seconds ---
--2 ) 0.1346898078918457 seconds ---
--3 ) 0.10243558883666992 seconds ---
--3b) 0.1004343032836914 seconds ---
--4 ) 0.028397560119628906 seconds ---
--5 ) 0.07552886009216309 seconds ---

answered Mar 2, 2021 at 11:42

Mauro Mascia

4666 silver badges16 bronze badges

1 Comment

Jon Over a year ago

Great job including the timer which is missing from all of these suggestions. There are a few more and importantly, it would also be good to add it on a larger dataset. Using 2.8 million rows with varying amount of duplicates shows some startling figures. Especially using the nlargest fails spectacularly (like more than 100 fold slower) on large data. The fastest for my data was the sort by then drop duplicate (drop all but last marginally faster than sort descending and drop all but first)

wjandrea · Accepted Answer · 2023-02-20 17:12:33Z

8

Try using nlargest on the groupby object. The advantage is that it returns the rows where "the nlargest item(s)" were fetched from, and we can get their index.

In this case, we want n=1 for the max and keep='all' to include duplicate maxes.

Note: we slice the last (-1) element of our index since our index in this case consist of tuples (e.g. ('MM1', 'S1', 0)).

df = pd.DataFrame({
    'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
    'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    'count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})

d = df.groupby(['Sp', 'Mt'])['count'].nlargest(1, keep='all')

df.loc[[i[-1] for i in d.index]]

    Sp  Mt  Val  count
0  MM1  S1    a      3
2  MM1  S3   cb      5
3  MM2  S3   mk      8
4  MM2  S4   bg     10
8  MM4  S2  uyi      7

edited Feb 20, 2023 at 17:12

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Jul 8, 2019 at 16:37

Kweweli

3354 silver badges8 bronze badges

3 Comments

wjandrea Over a year ago

If the input has a MultiIndex, it might be better to do something more like df.loc[d.droplevel(['Sp', 'Mt']).index]. I'm not sure.

wjandrea Over a year ago

You could do this more idiomatically with df.loc[d.index.get_level_values(-1)].

Prakash Vanapalli Over a year ago

this is correct but very very slow on large dataset with ~100k rows.

wjandrea · Accepted Answer · 2023-02-19 02:05:05Z

7

I've been using this functional style for many group operations:

df = pd.DataFrame({
    'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
    'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    'Count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})

(df.groupby(['Sp', 'Mt'])
   .apply(lambda group: group[group['Count'] == group['Count'].max()])
   .reset_index(drop=True))

    Sp  Mt  Val  Count
0  MM1  S1    a      3
1  MM1  S3   cb      5
2  MM2  S3   mk      8
3  MM2  S4   bg     10
4  MM4  S2  uyi      7

.reset_index(drop=True) gets you back to the original index by dropping the group-index.

edited Feb 19, 2023 at 2:05

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Jan 14, 2019 at 10:03

joh-mue

1,72615 silver badges20 bronze badges

1 Comment

wjandrea Over a year ago

Instead of reset_index, you could consider .droplevel([0]), with .groupby(..., as_index=False)

Surya Chhetri · Accepted Answer · 2019-04-10 02:38:11Z

6

Realizing that "applying" "nlargest" to groupby object works just as fine:

Additional advantage - also can fetch top n values if required:

In [85]: import pandas as pd

In [86]: df = pd.DataFrame({
    ...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
    ...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    ...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    ...: 'count' : [3,2,5,8,10,1,2,2,7]
    ...: })

## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
   count  mt   sp  val
0      3  S1  MM1    a
1      5  S3  MM1   cb
2      8  S3  MM2   mk
3     10  S4  MM2   bg
4      7  S2  MM4  uyi

answered Apr 10, 2019 at 2:38

Surya Chhetri

11.7k4 gold badges61 silver badges39 bronze badges

1 Comment

Benjamin Ziepert Over a year ago

This doesn't appear to work with ties.

nbertagnolli · Accepted Answer · 2021-04-20 21:51:53Z

4

If you sort your DataFrame that ordering will be preserved in the groupby. You can then just grab the first or last element and reset the index.

df = pd.DataFrame({
    'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
    'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    'count' : [3,2,5,8,10,1,2,2,7]
})

df.sort_values("count", ascending=False).groupby(["sp", "mt"]).first().reset_index()

answered Apr 20, 2021 at 21:51

nbertagnolli

5386 silver badges11 bronze badges

1 Comment

wjandrea Over a year ago

This is practically the same as Rani's answer and BENY's answer, just using a slightly different method.

Jon · Accepted Answer · 2022-08-02 13:04:20Z

4

Many of these are great answers, but to help show scalability, on 2.8 million rows with varying amount of duplicates shows some startling differences. The fastest for my data was the sort by then drop duplicate (drop all but last marginally faster than sort descending and drop all but first)

Sort Ascending, Drop duplicate keep last (2.22 s)
Sort Descending, Drop Duplicate keep First (2.32 s)
Transform Max within the loc function (3.73 s)
Transform Max storing IDX then using loc select as second step (3.84 s)
Groupby using Tail (8.98 s)
IDMax with groupby and then using loc select as second step (95.39 s)
IDMax with groupby within the loc select (95.74 s)
NLargest(1) then using iloc select as a second step (> 35000 s ) - did not finish after running overnight
NLargest(1) within iloc select (> 35000 s ) - did not finish after running overnight

As you can see Sort is 1/3 faster than transform and 75% faster than groupby. Everything else is up to 40x slower. In small datasets, this may not matter by much, but as you can see, this can significantly impact large datasets.

answered Aug 2, 2022 at 13:04

Jon

8041 gold badge8 silver badges34 bronze badges

1 Comment

user16836078 Over a year ago

Good guides to performance for those who are using one of these methods!

George Liu · Accepted Answer · 2018-08-08 18:25:07Z

2

df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))

answered Aug 8, 2018 at 18:25

George Liu

3,61311 gold badges50 silver badges73 bronze badges

Comments

citynorman · Accepted Answer · 2024-04-21 15:34:21Z

0

Another approach using rank

idx = df.groupby(['Sp', 'Mt'])['count'].rank(method="dense", ascending=False)==1
df[idx]

answered Apr 21, 2024 at 15:34

citynorman

5,3724 gold badges47 silver badges42 bronze badges

Comments

wjandrea · Accepted Answer · 2023-09-09 03:20:19Z

-1

df.loc[df.groupby('mt')['count'].idxmax()]

if the df index isn't unique you may need this step df.reset_index(inplace=True) first.

edited Sep 9, 2023 at 3:20

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Jul 7, 2022 at 2:52

upuil

852 silver badges8 bronze badges

1 Comment

wjandrea Over a year ago

This is a duplicate of Surya's answer except for the point about a non-unique index.

Collectives™ on Stack Overflow

Get the row(s) which have the max value in groups using groupby

16 Answers 16

5 Comments

9 Comments

2 Comments

5 Comments

7 Comments

1 Comment

1 Comment

1 Comment

3 Comments

1 Comment

1 Comment

1 Comment

1 Comment

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

16 Answers 16

5 Comments

9 Comments

2 Comments

5 Comments

7 Comments

1 Comment

1 Comment

1 Comment

3 Comments

1 Comment

1 Comment

1 Comment

1 Comment

Comments

Comments

1 Comment

Linked

Related