Pandas Dataframe GroupBy - Displaying Group Statistics

Question

For the Pandas dataframe:

import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
                    'code' : [random.choice(codes) for i in range(1,N+1)],
                    'colour': [random.choice(colours) for i in range(1,N+1)],
                    'texture': [random.choice(textures) for i in range(1,N+1)],
                    'size': [random.randint(1,100) for i in range(1,N+1)]
                    },  columns= ['id','code','colour', 'texture', 'size'])

I run the line below to get the aggregated sizes grouped by code and colour pairs:

grouped = df.groupby(['code', 'colour']).agg({'size' : np.sum}).reset_index()
>> grouped
>>     code colour  size
>> 0    one  black   987
>> 1    one  white   972
>> 2  three  black   972
>> 3  three  white   488
>> 4    two  black  1162
>> 5    two  white  1158
>> [6 rows x 3 columns]

In additon to the aggreageted (np.sum) sizes, I want to get separate columns for:

i. average value (np.avg) per group

ii. the id of the row with the max size for a given group,

iii. how many times the group occured (e.g. code=one, colour=black, 12 times)

Question: What is the fastest way to do this? Would I have to use apply() and a proprietary function?

Matti John · Accepted Answer · 2014-06-13 12:58:57Z

5

You can pass a list of functions to be applied to the group, e.g.:

grouped = df.groupby(['code', 'colour'])['size'].agg([np.sum, np.average, np.size, np.argmax]).reset_index()

Since argmax is the index of the maximum row, you will need to look them up on the original dataframe:

grouped['max_row_id'] = df.ix[grouped['argmax']].reset_index(grouped.index).id

NOTE: I selected the 'size' column because all the functions apply to that column. If you wanted to do a different set of functions for different columns, you can use agg with a dictionary with a list of functions e.g. agg({'size': [np.sum, np.average]}). This results in MultiIndex columns, which means that when getting the IDs for the maximum size in each group you need to do:

grouped['max_row_id'] = df.ix[grouped['size']['argmax']].reset_index(grouped.index).id

edited Jun 13, 2014 at 12:58

answered Jun 13, 2014 at 10:55

Matti John

20.7k7 gold badges46 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Zhubarb Over a year ago

Thank you for your clear answer. I noticed that the code works for just the size column version but does not work when I do:

grouped = df.groupby(['code', 'colour']).agg({'size':[np.sum, np.average, np.size, np.argmax]}).reset_index(); grouped['max_row_id'] = df.ix[grouped['argmax']].reset_index(grouped.index).id

. Am I missing something obvious?

Matti John Over a year ago

using a dictionary as the argument for agg results in MultiIndex columns, so it needs to be grouped['max_row_id'] = df.ix[grouped['size']['argmax']].reset_index(grouped.index).id. I'll update my answer to make this clearer

Zhubarb Over a year ago

I was looking at this in more detail and in the solution above, np.argmax does not return the index of the corresponding row in df. For let's say the one - black couple, it returns the row number of the df.ix[(df['code'] == 'one') & (df['colour'] == 'black')] with the highest size value. So the solution for getting max_row_id is not quite right.

Zhubarb Over a year ago

I solved it, in your answer np.argmax needs to be replaced by pd.Series.idxmax. I edited and re-accepted your answer.

Collectives™ on Stack Overflow

Pandas Dataframe GroupBy - Displaying Group Statistics

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related