Calculating monthly revenue given start and end date for each ID using Polars

Question

I have a dataframe using this format

import polars as pl

df = pl.from_repr("""
┌─────┬────────────┬────────────┬──────────┐
│ ID  ┆ DATE_PREV  ┆ DATE       ┆ REV_DIFF │
│ --- ┆ ---        ┆ ---        ┆ ---      │
│ i64 ┆ date       ┆ date       ┆ i64      │
╞═════╪════════════╪════════════╪══════════╡
│ 1   ┆ 2025-07-31 ┆ 2025-08-10 ┆ 5000     │
│ 2   ┆ 2025-06-01 ┆ 2025-06-01 ┆ 2500     │
│ 3   ┆ 2025-01-15 ┆ 2025-02-28 ┆ 60000    │
└─────┴────────────┴────────────┴──────────┘
""")

Now what I want to do is splitting every row in as many rows as there are months between DATE_PREV and DATE and calculate the monthly revenue.

So basically my expected output would be

ID	MONTH	REVENUE
1	2025-07-01	454,54
1	2025-08-01	4545,45
2	2025-06-01	2500
3	2025-01-01	22666,66
3	2025-02-01	37333,33

Daterange of ID 1 is 11 days in total from which 1 day is in July, so 1/11 of the revenue goes to July while 10/11 goes to August. Daterange of ID 2 is 1 day, so 100% goes to June. Daterange of ID 3 is 45 days, so 17 days go to January and 28 days to February. Dateranges can span multiple months and can start/end any day of the month.

I can't figure out a way to do this. Especially a way which would be somewhat performant with increasing row counts.

etrotta · Accepted Answer · 2025-10-08 15:53:09Z

5

The easiest way is calculating the amount each day contributes, exploding each day into a separate row, then grouping by month and aggregating.

You can use date_ranges to get each date within the interval for each row, explode to separate into individual rows, then group_by_dynamic to aggregate per month

dates = pl.date_ranges('DATE_PREV', 'DATE').alias('dates')
value_per_day = pl.col('REV_DIFF') / dates.list.len()

daily = df.select('ID', dates, value_per_day).explode('dates')

result = daily.group_by_dynamic('dates', every='1mo', group_by='ID').agg(pl.col('REV_DIFF').sum())
print(result)

answered Oct 8 at 15:53

etrotta

1,0721 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Henry Harbeck · Accepted Answer · 2025-10-09 03:33:48Z

Here is a solution that doesn't involve exploding each day into a row and grouping the data up again. It generates a date range, truncates each date to the month, then counts values.

This gives you a list of struct like (for e.g., for ID 1)

[{"MONTH": 2025-07-01, "count": 1}, {"MONTH": 2025-08-01, "count": 10}]

Then the list is exploded and you can compute the revenue that belongs to each month

date_ranges = pl.date_ranges("DATE_PREV", "DATE")

(
    df.with_columns(
        months=date_ranges.list.eval(pl.element().dt.truncate("1mo").alias("MONTH").value_counts())
    )
    .explode("months")
    .select(
        "ID",
        pl.col("months").struct["MONTH"],
        # E.g. for ID 1, month 2025-08-01 -> REVENUE = 5000 / 11 * 10
        REVENUE=pl.col("REV_DIFF") / date_ranges.list.len() * pl.col("months").struct["count"],
    )
)

shape: (5, 3)
┌─────┬────────────┬──────────────┐
│ ID  ┆ MONTH      ┆ REVENUE      │
│ --- ┆ ---        ┆ ---          │
│ i64 ┆ date       ┆ f64          │
╞═════╪════════════╪══════════════╡
│ 1   ┆ 2025-07-01 ┆ 454.545455   │
│ 1   ┆ 2025-08-01 ┆ 4545.454545  │
│ 2   ┆ 2025-06-01 ┆ 2500.0       │
│ 3   ┆ 2025-01-01 ┆ 22666.666667 │
│ 3   ┆ 2025-02-01 ┆ 37333.333333 │
└─────┴────────────┴──────────────┘

jqurious · Accepted Answer · 2025-10-09 11:02:41Z

Out of interest, I was trying to explode only the months:

(
    df.with_columns(
        (pl.col.DATE - pl.col.DATE_PREV).dt.total_days().alias("TOTAL_DAYS") + 1,
        pl.date_ranges(pl.col.DATE_PREV.dt.month_start(), pl.col.DATE.dt.month_end(), interval="1mo").alias("MONTH_START")        
    )
    .explode("MONTH_START")
)

shape: (5, 6)
┌─────┬────────────┬────────────┬──────────┬────────────┬─────────────┐
│ ID  ┆ DATE_PREV  ┆ DATE       ┆ REV_DIFF ┆ TOTAL_DAYS ┆ MONTH_START │
│ --- ┆ ---        ┆ ---        ┆ ---      ┆ ---        ┆ ---         │
│ i64 ┆ date       ┆ date       ┆ i64      ┆ i64        ┆ date        │
╞═════╪════════════╪════════════╪══════════╪════════════╪═════════════╡
│ 1   ┆ 2025-07-31 ┆ 2025-08-10 ┆ 5000     ┆ 11         ┆ 2025-07-01  │
│ 1   ┆ 2025-07-31 ┆ 2025-08-10 ┆ 5000     ┆ 11         ┆ 2025-08-01  │
│ 2   ┆ 2025-06-01 ┆ 2025-06-01 ┆ 2500     ┆ 1          ┆ 2025-06-01  │
│ 3   ┆ 2025-01-15 ┆ 2025-02-28 ┆ 60000    ┆ 45         ┆ 2025-01-01  │
│ 3   ┆ 2025-01-15 ┆ 2025-02-28 ┆ 60000    ┆ 45         ┆ 2025-02-01  │
└─────┴────────────┴────────────┴──────────┴────────────┴─────────────┘

It seems from here, there are 3 possible cases:

case
when DATE_PREV > MONTH_START            then  DAY_PREV_DAYS
when SAME_YEAR_MONTH(DATE, MONTH_START) then  DATE_DAYS
else                                          MONTH_DAYS
end

.when() can be used to construct similar logic.

(
    df
    .lazy()
    .with_columns(
        (pl.col.DATE - pl.col.DATE_PREV).dt.total_days().alias("TOTAL_DAYS") + 1,
        pl.date_ranges(pl.col.DATE_PREV.dt.month_start(), pl.col.DATE.dt.month_end(), interval="1mo").alias("MONTH_START")        
    )
    .explode("MONTH_START")
    .with_columns(
        pl.when(pl.col.DATE_PREV > pl.col.MONTH_START)
          .then(pl.col.DATE_PREV.dt.days_in_month() + 1 - pl.col.DATE_PREV.dt.day())
          .when(
              pl.col.DATE.dt.year() == pl.col.MONTH_START.dt.year(),  # .dt.to_string("%y%m") was a bit slower
              pl.col.DATE.dt.month() == pl.col.MONTH_START.dt.month(),
          )
          .then(pl.col.DATE.dt.day())
          .otherwise(pl.col.MONTH_START.dt.days_in_month())
          .alias("MONTH_DAYS")
    )
    .with_columns(
        (pl.col.REV_DIFF * (pl.col.MONTH_DAYS / pl.col.TOTAL_DAYS)).alias("REVENUE")
    )
    .collect(engine="streaming")
)

shape: (5, 8)
┌─────┬────────────┬────────────┬──────────┬────────────┬─────────────┬────────────┬──────────────┐
│ ID  ┆ DATE_PREV  ┆ DATE       ┆ REV_DIFF ┆ TOTAL_DAYS ┆ MONTH_START ┆ MONTH_DAYS ┆ REVENUE      │
│ --- ┆ ---        ┆ ---        ┆ ---      ┆ ---        ┆ ---         ┆ ---        ┆ ---          │
│ i64 ┆ date       ┆ date       ┆ i64      ┆ i64        ┆ date        ┆ i8         ┆ f64          │
╞═════╪════════════╪════════════╪══════════╪════════════╪═════════════╪════════════╪══════════════╡
│ 1   ┆ 2025-07-31 ┆ 2025-08-10 ┆ 5000     ┆ 11         ┆ 2025-07-01  ┆ 1          ┆ 454.545455   │
│ 1   ┆ 2025-07-31 ┆ 2025-08-10 ┆ 5000     ┆ 11         ┆ 2025-08-01  ┆ 10         ┆ 4545.454545  │
│ 2   ┆ 2025-06-01 ┆ 2025-06-01 ┆ 2500     ┆ 1          ┆ 2025-06-01  ┆ 1          ┆ 2500.0       │
│ 3   ┆ 2025-01-15 ┆ 2025-02-28 ┆ 60000    ┆ 45         ┆ 2025-01-01  ┆ 17         ┆ 22666.666667 │
│ 3   ┆ 2025-01-15 ┆ 2025-02-28 ┆ 60000    ┆ 45         ┆ 2025-02-01  ┆ 28         ┆ 37333.333333 │
└─────┴────────────┴────────────┴──────────┴────────────┴─────────────┴────────────┴──────────────┘

Using the Streaming Engine yielded the fastest results in my testing, so I've added lazy and collect calls.

Collectives™ on Stack Overflow

Calculating monthly revenue given start and end date for each ID using Polars

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related