Increment value based on condition

Question

I want to increment a column value based on a certain condition within a polars dataframe, while considering how many times that condition was met.

Example data.

import polars as pl

df = pl.DataFrame({
    "before": [0, 0, 0, 0, 0, 0, 0, 0, 0],
    "cdl_type": ["REC", "REC", "GEC", None, None, "GEC", None, "REC", "GEC"],
})

Current approach.

df = df.with_columns(
    a=(
        pl.when(pl.col("cdl_type").is_in(["GEC", "REC"])).then(
            pl.int_ranges(
                pl.col("cdl_type")
                .is_in(["REC", "GEC"])
                .rle()
                .struct.field("len")
            ).flatten()
        )
        .when(pl.col('cdl_type').is_null().and_(pl.col('cdl_type').shift(1).is_not_null()))
        .then(pl.lit(1))
        .otherwise(0)
    )
)

Expected output.

┌────────┬──────────┬───────┐
│ before ┆ cdl_type ┆ after │
│ ---    ┆ ---      ┆ ---   │
│ i64    ┆ str      ┆ i64   │
╞════════╪══════════╪═══════╡
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ REC      ┆ 1     │
│ 0      ┆ GEC      ┆ 2     │
│ 0      ┆ null     ┆ 3     │
│ 0      ┆ null     ┆ 0     │
│ 0      ┆ GEC      ┆ 0     │
│ 0      ┆ null     ┆ 1     │
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ GEC      ┆ 1     │
└────────┴──────────┴───────┘

Hericks · Accepted Answer · 2024-11-07 14:55:42Z

Based on the current approach and the expected result, I take that the condition is that cdcl_type equals either "REC" or "GEC".

The expected output can the be obtained as follows.

For each contiguous block of rows satisfying the condition, we obtain a corresponding id using pl.Expr.rle_id on expression for the condition.
We use the id to create an increasing integer sequence for each such block using pl.int_range.
Finally, we shift the sequence add 1 and fill any missing value with 0.

df.with_columns(
    pl.when(
        pl.col("cdl_type").is_not_null()
    ).then(
        pl.int_range(
            pl.len()
        ).over(
            pl.col("cdl_type").is_in(["REC", "GEC"]).rle_id()
        )
    ).add(1).shift().fill_null(0)
)

shape: (9, 3)
┌────────┬──────────┬─────────┐
│ before ┆ cdl_type ┆ literal │
│ ---    ┆ ---      ┆ ---     │
│ i64    ┆ str      ┆ i64     │
╞════════╪══════════╪═════════╡
│ 0      ┆ REC      ┆ 0       │
│ 0      ┆ REC      ┆ 1       │
│ 0      ┆ GEC      ┆ 2       │
│ 0      ┆ null     ┆ 3       │
│ 0      ┆ null     ┆ 0       │
│ 0      ┆ GEC      ┆ 0       │
│ 0      ┆ null     ┆ 1       │
│ 0      ┆ REC      ┆ 0       │
│ 0      ┆ GEC      ┆ 1       │
└────────┴──────────┴─────────┘

this solves my real world problem thank you so much for your help.

Cameron Riddell · Accepted Answer · 2024-11-07 16:51:13Z

(disclaimer: I had written this up earlier and came to post it- just realized how similar it is to @Hericks answer).

Since you want your counts to reset every time a contiguous group of 'REC' or 'GEC' is broken you’ll need to break this problem into 2 parts. Rewriting the logic into pseudo-code you end up with:

premise: When the current row has a value of either "REC" or "GEC"...
then: Take the incremental row count.
otherwise: The incremental count should become 0

Once we finish that logic, we should be able to shift everything down one row to create your desired output.

import polars as pl
from polars import col

df = pl.DataFrame({
    "before": [0, 0, 0, 0, 0, 0, 0, 0, 0],
    "cdl_type": ["REC", "REC", "GEC", None, None, "GEC", None, "REC", "GEC"],
})

print(
    df
    .with_columns(
        after=(
            pl.when(col('cdl_type').is_in(['GEC', 'REC']))
            .then(
                pl.int_range(pl.len()).add(1)
                .over(
                    col('cdl_type').is_in(['REC', 'GEC'])
                    .rle_id()
                )
            )
            .otherwise(0)
            .shift(1, fill_value=0)
        )
    )
)
# shape: (9, 3)
# ┌────────┬──────────┬───────┐
# │ before ┆ cdl_type ┆ after │
# │ ---    ┆ ---      ┆ ---   │
# │ i64    ┆ str      ┆ u32   │
# ╞════════╪══════════╪═══════╡
# │ 0      ┆ REC      ┆ 0     │
# │ 0      ┆ REC      ┆ 1     │
# │ 0      ┆ GEC      ┆ 2     │
# │ 0      ┆ null     ┆ 3     │
# │ 0      ┆ null     ┆ 0     │
# │ 0      ┆ GEC      ┆ 0     │
# │ 0      ┆ null     ┆ 1     │
# │ 0      ┆ REC      ┆ 0     │
# │ 0      ┆ GEC      ┆ 1     │
# └────────┴──────────┴───────┘

jqurious · Accepted Answer · 2024-11-08 10:15:20Z

We will add an extra null row into the sample frame (to help verify the result).

df = pl.DataFrame({
    "before": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "cdl_type": ["REC", "REC", "GEC", None, None, None, "GEC", None, "REC", "GEC"],
})

^{(My initial answer produced the expected output for the sample provided, but was broken for None runs of length > 2)}

It looks like you want to also include the first following null as part of the run-length - which could be done by forward filling 1 step.

df.with_columns(
    pl.when(pl.col("cdl_type").is_not_null())
      .then(pl.col("cdl_type").is_not_null().rle_id())
      .forward_fill(limit=1)
      .rle_id()
      .alias("rle_id")
)

shape: (10, 3)
┌────────┬──────────┬────────┐
│ before ┆ cdl_type ┆ rle_id │
│ ---    ┆ ---      ┆ ---    │
│ i64    ┆ str      ┆ u32    │
╞════════╪══════════╪════════╡
│ 0      ┆ REC      ┆ 0      │
│ 0      ┆ REC      ┆ 0      │
│ 0      ┆ GEC      ┆ 0      │
│ 0      ┆ null     ┆ 0      │ # <-
│ 0      ┆ null     ┆ 1      │
│ 0      ┆ null     ┆ 1      │
│ 0      ┆ GEC      ┆ 2      │
│ 0      ┆ null     ┆ 2      │ # <-
│ 0      ┆ REC      ┆ 3      │
│ 0      ┆ GEC      ┆ 3      │
└────────┴──────────┴────────┘

rle() gives a struct containing each {len,value}

df.select(
    pl.when(pl.col("cdl_type").is_not_null())
      .then(pl.col("cdl_type").is_not_null().rle_id())
      .forward_fill(limit=1)
      .rle()
)

shape: (4, 1)
┌───────────┐
│ cdl_type  │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {4,0}     │
│ {2,null}  │
│ {2,2}     │
│ {2,4}     │
└───────────┘

The len values are given to int_ranges() and flattened to create the count column.

df.with_columns(
    pl.int_ranges(
        pl.when(pl.col("cdl_type").is_not_null())
          .then(pl.col("cdl_type").is_not_null().rle_id())
          .forward_fill(limit=1)
          .rle()
          .struct.field("len")
    )
    .flatten()
    .alias("after")
)

shape: (10, 3)
┌────────┬──────────┬───────┐
│ before ┆ cdl_type ┆ after │
│ ---    ┆ ---      ┆ ---   │
│ i64    ┆ str      ┆ i64   │
╞════════╪══════════╪═══════╡
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ REC      ┆ 1     │
│ 0      ┆ GEC      ┆ 2     │
│ 0      ┆ null     ┆ 3     │
│ 0      ┆ null     ┆ 0     │
│ 0      ┆ null     ┆ 1     │ # <- NOT OK
│ 0      ┆ GEC      ┆ 0     │
│ 0      ┆ null     ┆ 1     │
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ GEC      ┆ 1     │
└────────┴──────────┴───────┘

We then set the remaining null values back to 0

df.with_columns(
    pl.int_ranges(
        pl.when(pl.col("cdl_type").is_not_null())
          .then(pl.col("cdl_type").is_not_null().rle_id())
          .forward_fill(limit=1)
          .rle()
          .struct.field("len")
    )
    .flatten()
    .alias("after")
).with_columns(
    pl.when(pl.col("cdl_type").shift().is_not_null())
      .then(pl.col("after"))
      .otherwise(0)
)

shape: (10, 3)
┌────────┬──────────┬───────┐
│ before ┆ cdl_type ┆ after │
│ ---    ┆ ---      ┆ ---   │
│ i64    ┆ str      ┆ i64   │
╞════════╪══════════╪═══════╡
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ REC      ┆ 1     │
│ 0      ┆ GEC      ┆ 2     │
│ 0      ┆ null     ┆ 3     │
│ 0      ┆ null     ┆ 0     │
│ 0      ┆ null     ┆ 0     │ # <- OK
│ 0      ┆ GEC      ┆ 0     │
│ 0      ┆ null     ┆ 1     │
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ GEC      ┆ 1     │
└────────┴──────────┴───────┘

The reason for this approach is that it avoids using .over() - and if dealing with larger dataframes produces much faster results.

https://github.com/pola-rs/polars/issues/19089

roman · Accepted Answer · 2024-11-07 15:51:08Z

0

df.with_columns(
    after = pl.int_range(pl.len()).over(
        pl.col.cdl_type
        .is_in(["REC", "GEC"])
        .fill_null(pl.int_range(pl.len()) + 2)
        .rle_id()
    )
    .add(~pl.col.cdl_type.is_null()).shift().fill_null(0)
)

shape: (9, 3)
┌────────┬──────────┬───────┐
│ before ┆ cdl_type ┆ after │
│ ---    ┆ ---      ┆ ---   │
│ i64    ┆ str      ┆ i64   │
╞════════╪══════════╪═══════╡
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ REC      ┆ 1     │
│ 0      ┆ GEC      ┆ 2     │
│ 0      ┆ null     ┆ 3     │
│ 0      ┆ null     ┆ 0     │
│ 0      ┆ GEC      ┆ 0     │
│ 0      ┆ null     ┆ 1     │
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ GEC      ┆ 1     │
└────────┴──────────┴───────┘

edited Nov 7, 2024 at 15:51

answered Nov 7, 2024 at 15:17

roman

118k30 gold badges205 silver badges209 bronze badges

4 Comments

Hericks Over a year ago

This fails if cdl_type has 2 leading nulls, no? As the 1 from .fill_null(pl.int_range(pl.len())) would contribute to the first true condition block.

roman Over a year ago

true, thanks for that, looks like simple +1 can fix that though (or pl.int_range(1, pl.len() + 1)

Hericks Over a year ago

I think a +2 would be needed. Otherwise, a single leading null leads to the same effect. +2 should guarantee that you'll never have a 1 in the int range, that could contribute to a true condition block.

roman Over a year ago

[facepalm] of course cause int range starts from 0

Collectives™ on Stack Overflow

Increment value based on condition

4 Answers 4

1 Comment

Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related