Lazy evaluations for DataFrames

Question

Let me provide quick demo which shows that second approach is 10x times slower than the first one.

import pandas as pd
from timeit import default_timer as timer

r = range(1,int(1e7))

df = pd.DataFrame({
    'col0': [i % 3 for i in r],
    'col1': r
})

df['pad'] = '*' * 100

start = time.time()
print(df.groupby('col0')['col1'].min())
end = time.time()
print(end - start)

start = time.time()
print(df.groupby('col0').min()['col1'])
end = time.time()
print(end - start)

Output:

col0
0    3
1    1
2    2
Name: col1, dtype: int64
0.14302301406860352
col0
0    3
1    1
2    2
Name: col1, dtype: int64
1.4934422969818115

The reason is obvious - in the second case python calculates min also for column pad while in first case it does not do that.

Is there any way to make python aware that computation on DataFrameGroupBy is required for col1 only in the second case?

If this is impossible then I'm curious if this is limitation of current pandas implementation or limitation of the python language itself (i.e. expression df.groupby('col0').min() must be fully computed no matter what follows next).

Thanks

can you explain the usecase a bit more? i'm unclear on why you wouldn't use the first method in all situations given that it's faster — Derek O
– Derek O, Commented Mar 4 at 0:18
if you need to compute only for col1 then use first version - and this is "way to make python aware that computation on DataFrameGroupBy is required for col1 only" - and it doesn't need any changes in pandas — furas
– furas, Commented Mar 4 at 9:53
@DerekO, Let me provide some analogy. When using SQL, it does not matter how many columns are defined in a view. When I select specific columns from the view then only they are getting calculated (well, it depends on SQL engine and some other details but this is the rule). You can read more about SQL projection. — Slimboy Fat
– Slimboy Fat, Commented Mar 4 at 15:13
Back to python now. For example, if I want encapsulate some complex logic into a function which returns data frame object I want “the engine” to be able to calculate columns of that data frame when needed (or requested) instead of calculating all of them no matter what. — Slimboy Fat
– Slimboy Fat, Commented Mar 4 at 15:14
@SlimboyFat: right, so the short answer is that pandas does not work like that. groupby operations are first fully evaluated before you can select a particular column. So, your options are basically: df.groupby('col0')['col1'].min(), or (in this example) df.groupby('col0').min(numeric_only=True), or df.groupby('col0').agg({col: 'min' for col in ['col1']}). — ouroboros1
– ouroboros1, Commented Mar 4 at 15:30

Slimboy Fat · Accepted Answer · 2025-10-04 09:31:41Z

1

pandas data frames use eager executon model by design

https://pandas.pydata.org/pandas-docs/version/0.18.1/release.html#id96

Eager evaluation of groups when calling groupby functions, so if there is an exception with the grouping function it will raised immediately versus sometime later on when the groups are needed

The alternative is pandas on Spark - https://spark.apache.org/pandas-on-spark/

pandas uses eager evaluation. It loads all the data into memory and executes operations immediately when they are invoked. pandas does not apply query optimization and all the data must be loaded into memory before the query is executed.

It is possible to convert between the two - to_spark/to_pandas.

Similarly it is possible to convert between pandas and traditional Spark data frames - createDataFrame/toPandas.

answered Oct 4 at 9:31

Slimboy Fat

6631 gold badge4 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Lazy evaluations for DataFrames

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related