For the Pandas dataframe:
import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
'code' : [random.choice(codes) for i in range(1,N+1)],
'colour': [random.choice(colours) for i in range(1,N+1)],
'texture': [random.choice(textures) for i in range(1,N+1)],
'size': [random.randint(1,100) for i in range(1,N+1)]
}, columns= ['id','code','colour', 'texture', 'size'])
I run the line below to get the aggregated sizes grouped by code and colour pairs:
grouped = df.groupby(['code', 'colour']).agg({'size' : np.sum}).reset_index()
>> grouped
>> code colour size
>> 0 one black 987
>> 1 one white 972
>> 2 three black 972
>> 3 three white 488
>> 4 two black 1162
>> 5 two white 1158
>> [6 rows x 3 columns]
In additon to the aggreageted (np.sum) sizes, I want to get separate columns for:
i. average value (np.avg) per group
ii. the id of the row with the max size for a given group,
iii. how many times the group occured (e.g. code=one, colour=black, 12 times)
Question: What is the fastest way to do this? Would I have to use apply() and a proprietary function?