1

For the Pandas dataframe:

import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
                    'code' : [random.choice(codes) for i in range(1,N+1)],
                    'colour': [random.choice(colours) for i in range(1,N+1)],
                    'texture': [random.choice(textures) for i in range(1,N+1)],
                    'size': [random.randint(1,100) for i in range(1,N+1)]
                    },  columns= ['id','code','colour', 'texture', 'size'])

I run the line below to get the aggregated sizes grouped by code and colour pairs:

grouped = df.groupby(['code', 'colour']).agg({'size' : np.sum}).reset_index()
>> grouped
>>     code colour  size
>> 0    one  black   987
>> 1    one  white   972
>> 2  three  black   972
>> 3  three  white   488
>> 4    two  black  1162
>> 5    two  white  1158
>> [6 rows x 3 columns]

In additon to the aggreageted (np.sum) sizes, I want to get separate columns for:

i. average value (np.avg) per group

ii. the id of the row with the max size for a given group,

iii. how many times the group occured (e.g. code=one, colour=black, 12 times)

Question: What is the fastest way to do this? Would I have to use apply() and a proprietary function?

1 Answer 1

5

You can pass a list of functions to be applied to the group, e.g.:

grouped = df.groupby(['code', 'colour'])['size'].agg([np.sum, np.average, np.size, np.argmax]).reset_index()

Since argmax is the index of the maximum row, you will need to look them up on the original dataframe:

grouped['max_row_id'] = df.ix[grouped['argmax']].reset_index(grouped.index).id

NOTE: I selected the 'size' column because all the functions apply to that column. If you wanted to do a different set of functions for different columns, you can use agg with a dictionary with a list of functions e.g. agg({'size': [np.sum, np.average]}). This results in MultiIndex columns, which means that when getting the IDs for the maximum size in each group you need to do:

grouped['max_row_id'] = df.ix[grouped['size']['argmax']].reset_index(grouped.index).id
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your clear answer. I noticed that the code works for just the size column version but does not work when I do: grouped = df.groupby(['code', 'colour']).agg({'size':[np.sum, np.average, np.size, np.argmax]}).reset_index(); grouped['max_row_id'] = df.ix[grouped['argmax']].reset_index(grouped.index).id. Am I missing something obvious?
using a dictionary as the argument for agg results in MultiIndex columns, so it needs to be grouped['max_row_id'] = df.ix[grouped['size']['argmax']].reset_index(grouped.index).id. I'll update my answer to make this clearer
I was looking at this in more detail and in the solution above, np.argmax does not return the index of the corresponding row in df. For let's say the one - black couple, it returns the row number of the df.ix[(df['code'] == 'one') & (df['colour'] == 'black')] with the highest size value. So the solution for getting max_row_id is not quite right.
I solved it, in your answer np.argmax needs to be replaced by pd.Series.idxmax. I edited and re-accepted your answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.