Let me provide quick demo which shows that second approach is 10x times slower than the first one.
import pandas as pd
from timeit import default_timer as timer
r = range(1,int(1e7))
df = pd.DataFrame({
'col0': [i % 3 for i in r],
'col1': r
})
df['pad'] = '*' * 100
start = time.time()
print(df.groupby('col0')['col1'].min())
end = time.time()
print(end - start)
start = time.time()
print(df.groupby('col0').min()['col1'])
end = time.time()
print(end - start)
Output:
col0
0 3
1 1
2 2
Name: col1, dtype: int64
0.14302301406860352
col0
0 3
1 1
2 2
Name: col1, dtype: int64
1.4934422969818115
The reason is obvious - in the second case python calculates min also for column pad while in first case it does not do that.
Is there any way to make python aware that computation on DataFrameGroupBy is required for col1 only in the second case?
If this is impossible then I'm curious if this is limitation of current pandas implementation or limitation of the python language itself (i.e. expression df.groupby('col0').min() must be fully computed no matter what follows next).
Thanks
col1then use first version - and this is "way to make python aware that computation on DataFrameGroupBy is required for col1 only" - and it doesn't need any changes in pandaspandasdoes not work like that.groupbyoperations are first fully evaluated before you can select a particular column. So, your options are basically:df.groupby('col0')['col1'].min(), or (in this example)df.groupby('col0').min(numeric_only=True), ordf.groupby('col0').agg({col: 'min' for col in ['col1']}).