3

I have a dataframe using this format

import polars as pl

df = pl.from_repr("""
┌─────┬────────────┬────────────┬──────────┐
│ ID  ┆ DATE_PREV  ┆ DATE       ┆ REV_DIFF │
│ --- ┆ ---        ┆ ---        ┆ ---      │
│ i64 ┆ date       ┆ date       ┆ i64      │
╞═════╪════════════╪════════════╪══════════╡
│ 1   ┆ 2025-07-31 ┆ 2025-08-10 ┆ 5000     │
│ 2   ┆ 2025-06-01 ┆ 2025-06-01 ┆ 2500     │
│ 3   ┆ 2025-01-15 ┆ 2025-02-28 ┆ 60000    │
└─────┴────────────┴────────────┴──────────┘
""")

Now what I want to do is splitting every row in as many rows as there are months between DATE_PREV and DATE and calculate the monthly revenue.

So basically my expected output would be

ID MONTH REVENUE
1 2025-07-01 454,54
1 2025-08-01 4545,45
2 2025-06-01 2500
3 2025-01-01 22666,66
3 2025-02-01 37333,33

Daterange of ID 1 is 11 days in total from which 1 day is in July, so 1/11 of the revenue goes to July while 10/11 goes to August. Daterange of ID 2 is 1 day, so 100% goes to June. Daterange of ID 3 is 45 days, so 17 days go to January and 28 days to February. Dateranges can span multiple months and can start/end any day of the month.

I can't figure out a way to do this. Especially a way which would be somewhat performant with increasing row counts.

3 Answers 3

5

The easiest way is calculating the amount each day contributes, exploding each day into a separate row, then grouping by month and aggregating.

You can use date_ranges to get each date within the interval for each row, explode to separate into individual rows, then group_by_dynamic to aggregate per month

dates = pl.date_ranges('DATE_PREV', 'DATE').alias('dates')
value_per_day = pl.col('REV_DIFF') / dates.list.len()

daily = df.select('ID', dates, value_per_day).explode('dates')

result = daily.group_by_dynamic('dates', every='1mo', group_by='ID').agg(pl.col('REV_DIFF').sum())
print(result)
Sign up to request clarification or add additional context in comments.

Comments

2

Here is a solution that doesn't involve exploding each day into a row and grouping the data up again. It generates a date range, truncates each date to the month, then counts values.

This gives you a list of struct like (for e.g., for ID 1)

[{"MONTH": 2025-07-01, "count": 1}, {"MONTH": 2025-08-01, "count": 10}]

Then the list is exploded and you can compute the revenue that belongs to each month

date_ranges = pl.date_ranges("DATE_PREV", "DATE")

(
    df.with_columns(
        months=date_ranges.list.eval(pl.element().dt.truncate("1mo").alias("MONTH").value_counts())
    )
    .explode("months")
    .select(
        "ID",
        pl.col("months").struct["MONTH"],
        # E.g. for ID 1, month 2025-08-01 -> REVENUE = 5000 / 11 * 10
        REVENUE=pl.col("REV_DIFF") / date_ranges.list.len() * pl.col("months").struct["count"],
    )
)
shape: (5, 3)
┌─────┬────────────┬──────────────┐
│ ID  ┆ MONTH      ┆ REVENUE      │
│ --- ┆ ---        ┆ ---          │
│ i64 ┆ date       ┆ f64          │
╞═════╪════════════╪══════════════╡
│ 1   ┆ 2025-07-01 ┆ 454.545455   │
│ 1   ┆ 2025-08-01 ┆ 4545.454545  │
│ 2   ┆ 2025-06-01 ┆ 2500.0       │
│ 3   ┆ 2025-01-01 ┆ 22666.666667 │
│ 3   ┆ 2025-02-01 ┆ 37333.333333 │
└─────┴────────────┴──────────────┘

Comments

1

Out of interest, I was trying to explode only the months:

(
    df.with_columns(
        (pl.col.DATE - pl.col.DATE_PREV).dt.total_days().alias("TOTAL_DAYS") + 1,
        pl.date_ranges(pl.col.DATE_PREV.dt.month_start(), pl.col.DATE.dt.month_end(), interval="1mo").alias("MONTH_START")        
    )
    .explode("MONTH_START")
)
shape: (5, 6)
┌─────┬────────────┬────────────┬──────────┬────────────┬─────────────┐
│ ID  ┆ DATE_PREV  ┆ DATE       ┆ REV_DIFF ┆ TOTAL_DAYS ┆ MONTH_START │
│ --- ┆ ---        ┆ ---        ┆ ---      ┆ ---        ┆ ---         │
│ i64 ┆ date       ┆ date       ┆ i64      ┆ i64        ┆ date        │
╞═════╪════════════╪════════════╪══════════╪════════════╪═════════════╡
│ 1   ┆ 2025-07-31 ┆ 2025-08-10 ┆ 5000     ┆ 11         ┆ 2025-07-01  │
│ 1   ┆ 2025-07-31 ┆ 2025-08-10 ┆ 5000     ┆ 11         ┆ 2025-08-01  │
│ 2   ┆ 2025-06-01 ┆ 2025-06-01 ┆ 2500     ┆ 1          ┆ 2025-06-01  │
│ 3   ┆ 2025-01-15 ┆ 2025-02-28 ┆ 60000    ┆ 45         ┆ 2025-01-01  │
│ 3   ┆ 2025-01-15 ┆ 2025-02-28 ┆ 60000    ┆ 45         ┆ 2025-02-01  │
└─────┴────────────┴────────────┴──────────┴────────────┴─────────────┘

It seems from here, there are 3 possible cases:

case
when DATE_PREV > MONTH_START            then  DAY_PREV_DAYS
when SAME_YEAR_MONTH(DATE, MONTH_START) then  DATE_DAYS
else                                          MONTH_DAYS
end

.when() can be used to construct similar logic.

(
    df
    .lazy()
    .with_columns(
        (pl.col.DATE - pl.col.DATE_PREV).dt.total_days().alias("TOTAL_DAYS") + 1,
        pl.date_ranges(pl.col.DATE_PREV.dt.month_start(), pl.col.DATE.dt.month_end(), interval="1mo").alias("MONTH_START")        
    )
    .explode("MONTH_START")
    .with_columns(
        pl.when(pl.col.DATE_PREV > pl.col.MONTH_START)
          .then(pl.col.DATE_PREV.dt.days_in_month() + 1 - pl.col.DATE_PREV.dt.day())
          .when(
              pl.col.DATE.dt.year() == pl.col.MONTH_START.dt.year(),  # .dt.to_string("%y%m") was a bit slower
              pl.col.DATE.dt.month() == pl.col.MONTH_START.dt.month(),
          )
          .then(pl.col.DATE.dt.day())
          .otherwise(pl.col.MONTH_START.dt.days_in_month())
          .alias("MONTH_DAYS")
    )
    .with_columns(
        (pl.col.REV_DIFF * (pl.col.MONTH_DAYS / pl.col.TOTAL_DAYS)).alias("REVENUE")
    )
    .collect(engine="streaming")
)
shape: (5, 8)
┌─────┬────────────┬────────────┬──────────┬────────────┬─────────────┬────────────┬──────────────┐
│ ID  ┆ DATE_PREV  ┆ DATE       ┆ REV_DIFF ┆ TOTAL_DAYS ┆ MONTH_START ┆ MONTH_DAYS ┆ REVENUE      │
│ --- ┆ ---        ┆ ---        ┆ ---      ┆ ---        ┆ ---         ┆ ---        ┆ ---          │
│ i64 ┆ date       ┆ date       ┆ i64      ┆ i64        ┆ date        ┆ i8         ┆ f64          │
╞═════╪════════════╪════════════╪══════════╪════════════╪═════════════╪════════════╪══════════════╡
│ 1   ┆ 2025-07-31 ┆ 2025-08-10 ┆ 5000     ┆ 11         ┆ 2025-07-01  ┆ 1          ┆ 454.545455   │
│ 1   ┆ 2025-07-31 ┆ 2025-08-10 ┆ 5000     ┆ 11         ┆ 2025-08-01  ┆ 10         ┆ 4545.454545  │
│ 2   ┆ 2025-06-01 ┆ 2025-06-01 ┆ 2500     ┆ 1          ┆ 2025-06-01  ┆ 1          ┆ 2500.0       │
│ 3   ┆ 2025-01-15 ┆ 2025-02-28 ┆ 60000    ┆ 45         ┆ 2025-01-01  ┆ 17         ┆ 22666.666667 │
│ 3   ┆ 2025-01-15 ┆ 2025-02-28 ┆ 60000    ┆ 45         ┆ 2025-02-01  ┆ 28         ┆ 37333.333333 │
└─────┴────────────┴────────────┴──────────┴────────────┴─────────────┴────────────┴──────────────┘

Using the Streaming Engine yielded the fastest results in my testing, so I've added lazy and collect calls.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.