4

Here's some sample data

import polars as pl

df = pl.DataFrame(
    {
        "date": [
            "2024-08-01",
            "2024-08-02",
            "2024-08-03",
            "2024-08-04",
            "2024-08-04",
            "2024-08-05",
            "2024-08-06",
            "2024-08-08",
            "2024-08-09",
        ],
        "type": ["A", "A", "A", "A", "B", "B", "B", "A", "A"],
    }
).with_columns(pl.col("date").str.to_date())

And my desired output would look something like this

shape: (9, 3)
┌────────────┬──────┬───────────────┐
│ date       ┆ type ┆ days_in_a_row │
│ ---        ┆ ---  ┆ ---           │
│ date       ┆ str  ┆ i64           │
╞════════════╪══════╪═══════════════╡
│ 2024-08-01 ┆ A    ┆ 1             │
│ 2024-08-02 ┆ A    ┆ 2             │
│ 2024-08-03 ┆ A    ┆ 3             │
│ 2024-08-04 ┆ A    ┆ 4             │
│ 2024-08-04 ┆ B    ┆ 1             │
│ 2024-08-05 ┆ B    ┆ 2             │
│ 2024-08-06 ┆ B    ┆ 3             │
│ 2024-08-08 ┆ A    ┆ 1             │
│ 2024-08-09 ┆ A    ┆ 2             │
└────────────┴──────┴───────────────┘

Where my days_in_a_row counter gets reset upon a date gap greater than 1 day.

What I've tried so far

df.with_columns(days_in_a_row=pl.cum_count("date").over("type"))

Which gives me

shape: (9, 3)
┌────────────┬──────┬───────────────┐
│ date       ┆ type ┆ days_in_a_row │
│ ---        ┆ ---  ┆ ---           │
│ date       ┆ str  ┆ u32           │
╞════════════╪══════╪═══════════════╡
│ 2024-08-01 ┆ A    ┆ 1             │
│ 2024-08-02 ┆ A    ┆ 2             │
│ 2024-08-03 ┆ A    ┆ 3             │
│ 2024-08-04 ┆ A    ┆ 4             │
│ 2024-08-04 ┆ B    ┆ 1             │
│ 2024-08-05 ┆ B    ┆ 2             │
│ 2024-08-06 ┆ B    ┆ 3             │
│ 2024-08-08 ┆ A    ┆ 5             │
│ 2024-08-09 ┆ A    ┆ 6             │
└────────────┴──────┴───────────────┘

Which is not resetting after the gap. I can't quite nail this one down.

I've also tried variations with

df
    .with_columns(date_gap=pl.col("date").diff().over("type"))
    .with_columns(days_in_a_row=(pl.cum_count("date").over("date_gap", "type")))

Which get's closer, but it still ends up not resetting where I'd want it

shape: (9, 4)
┌────────────┬──────┬──────────────┬───────────────┐
│ date       ┆ type ┆ date_gap     ┆ days_in_a_row │
│ ---        ┆ ---  ┆ ---          ┆ ---           │
│ date       ┆ str  ┆ duration[ms] ┆ u32           │
╞════════════╪══════╪══════════════╪═══════════════╡
│ 2024-08-01 ┆ A    ┆ null         ┆ 1             │
│ 2024-08-02 ┆ A    ┆ 1d           ┆ 1             │
│ 2024-08-03 ┆ A    ┆ 1d           ┆ 2             │
│ 2024-08-04 ┆ A    ┆ 1d           ┆ 3             │
│ 2024-08-04 ┆ B    ┆ null         ┆ 1             │
│ 2024-08-05 ┆ B    ┆ 1d           ┆ 1             │
│ 2024-08-06 ┆ B    ┆ 1d           ┆ 2             │
│ 2024-08-08 ┆ A    ┆ 4d           ┆ 1             │
│ 2024-08-09 ┆ A    ┆ 1d           ┆ 4             │
└────────────┴──────┴──────────────┴───────────────┘

1 Answer 1

3

What about first creating a secondary grouper based on the duration ≥ 1day?

from datetime import timedelta

(df.with_columns(group=pl.col("date").diff().gt(timedelta(days=1))
                         .fill_null(True).cum_sum().over("type"))
   .with_columns(days_in_a_row=pl.cum_count("date").over(["type", "group"]))
)

Output:

┌────────────┬──────┬───────┬───────────────┐
│ date       ┆ type ┆ group ┆ days_in_a_row │
│ ---        ┆ ---  ┆ ---   ┆ ---           │
│ date       ┆ str  ┆ u32   ┆ u32           │
╞════════════╪══════╪═══════╪═══════════════╡
│ 2024-08-01 ┆ A    ┆ 1     ┆ 1             │
│ 2024-08-02 ┆ A    ┆ 1     ┆ 2             │
│ 2024-08-03 ┆ A    ┆ 1     ┆ 3             │
│ 2024-08-04 ┆ A    ┆ 1     ┆ 4             │
│ 2024-08-04 ┆ B    ┆ 1     ┆ 1             │
│ 2024-08-05 ┆ B    ┆ 1     ┆ 2             │
│ 2024-08-06 ┆ B    ┆ 1     ┆ 3             │
│ 2024-08-08 ┆ A    ┆ 2     ┆ 1             │
│ 2024-08-09 ┆ A    ┆ 2     ┆ 2             │
└────────────┴──────┴───────┴───────────────┘
Sign up to request clarification or add additional context in comments.

3 Comments

Good solution! Minor: You can compare against pl.duration(days=1) and avoid the datetime.timedelta import.
Thanks @Hericks! I actually tried pl.duration(days=1) and got an error (not able to check it now), which is why I used datetime.timedelta. I'll double check it later.
Interesting, considering that pl.duration(days=1) worked fine for me ¯\\_(ツ)_/¯

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.