3

I want to increment a column value based on a certain condition within a polars dataframe, while considering how many times that condition was met.

Example data.

import polars as pl

df = pl.DataFrame({
    "before": [0, 0, 0, 0, 0, 0, 0, 0, 0],
    "cdl_type": ["REC", "REC", "GEC", None, None, "GEC", None, "REC", "GEC"],
})

Current approach.

df = df.with_columns(
    a=(
        pl.when(pl.col("cdl_type").is_in(["GEC", "REC"])).then(
            pl.int_ranges(
                pl.col("cdl_type")
                .is_in(["REC", "GEC"])
                .rle()
                .struct.field("len")
            ).flatten()
        )
        .when(pl.col('cdl_type').is_null().and_(pl.col('cdl_type').shift(1).is_not_null()))
        .then(pl.lit(1))
        .otherwise(0)
    )
)

Expected output.

┌────────┬──────────┬───────┐
│ before ┆ cdl_type ┆ after │
│ ---    ┆ ---      ┆ ---   │
│ i64    ┆ str      ┆ i64   │
╞════════╪══════════╪═══════╡
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ REC      ┆ 1     │
│ 0      ┆ GEC      ┆ 2     │
│ 0      ┆ null     ┆ 3     │
│ 0      ┆ null     ┆ 0     │
│ 0      ┆ GEC      ┆ 0     │
│ 0      ┆ null     ┆ 1     │
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ GEC      ┆ 1     │
└────────┴──────────┴───────┘

4 Answers 4

2

Based on the current approach and the expected result, I take that the condition is that cdcl_type equals either "REC" or "GEC".

The expected output can the be obtained as follows.

  • For each contiguous block of rows satisfying the condition, we obtain a corresponding id using pl.Expr.rle_id on expression for the condition.
  • We use the id to create an increasing integer sequence for each such block using pl.int_range.
  • Finally, we shift the sequence add 1 and fill any missing value with 0.
df.with_columns(
    pl.when(
        pl.col("cdl_type").is_not_null()
    ).then(
        pl.int_range(
            pl.len()
        ).over(
            pl.col("cdl_type").is_in(["REC", "GEC"]).rle_id()
        )
    ).add(1).shift().fill_null(0)
)
shape: (9, 3)
┌────────┬──────────┬─────────┐
│ before ┆ cdl_type ┆ literal │
│ ---    ┆ ---      ┆ ---     │
│ i64    ┆ str      ┆ i64     │
╞════════╪══════════╪═════════╡
│ 0      ┆ REC      ┆ 0       │
│ 0      ┆ REC      ┆ 1       │
│ 0      ┆ GEC      ┆ 2       │
│ 0      ┆ null     ┆ 3       │
│ 0      ┆ null     ┆ 0       │
│ 0      ┆ GEC      ┆ 0       │
│ 0      ┆ null     ┆ 1       │
│ 0      ┆ REC      ┆ 0       │
│ 0      ┆ GEC      ┆ 1       │
└────────┴──────────┴─────────┘
Sign up to request clarification or add additional context in comments.

1 Comment

this solves my real world problem thank you so much for your help.
1

(disclaimer: I had written this up earlier and came to post it- just realized how similar it is to @Hericks answer).

Since you want your counts to reset every time a contiguous group of 'REC' or 'GEC' is broken you’ll need to break this problem into 2 parts. Rewriting the logic into pseudo-code you end up with:

  • premise: When the current row has a value of either "REC" or "GEC"...
  • then: Take the incremental row count.
  • otherwise: The incremental count should become 0

Once we finish that logic, we should be able to shift everything down one row to create your desired output.

import polars as pl
from polars import col

df = pl.DataFrame({
    "before": [0, 0, 0, 0, 0, 0, 0, 0, 0],
    "cdl_type": ["REC", "REC", "GEC", None, None, "GEC", None, "REC", "GEC"],
})

print(
    df
    .with_columns(
        after=(
            pl.when(col('cdl_type').is_in(['GEC', 'REC']))
            .then(
                pl.int_range(pl.len()).add(1)
                .over(
                    col('cdl_type').is_in(['REC', 'GEC'])
                    .rle_id()
                )
            )
            .otherwise(0)
            .shift(1, fill_value=0)
        )
    )
)
# shape: (9, 3)
# ┌────────┬──────────┬───────┐
# │ before ┆ cdl_type ┆ after │
# │ ---    ┆ ---      ┆ ---   │
# │ i64    ┆ str      ┆ u32   │
# ╞════════╪══════════╪═══════╡
# │ 0      ┆ REC      ┆ 0     │
# │ 0      ┆ REC      ┆ 1     │
# │ 0      ┆ GEC      ┆ 2     │
# │ 0      ┆ null     ┆ 3     │
# │ 0      ┆ null     ┆ 0     │
# │ 0      ┆ GEC      ┆ 0     │
# │ 0      ┆ null     ┆ 1     │
# │ 0      ┆ REC      ┆ 0     │
# │ 0      ┆ GEC      ┆ 1     │
# └────────┴──────────┴───────┘

Comments

1

We will add an extra null row into the sample frame (to help verify the result).

df = pl.DataFrame({
    "before": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "cdl_type": ["REC", "REC", "GEC", None, None, None, "GEC", None, "REC", "GEC"],
})

(My initial answer produced the expected output for the sample provided, but was broken for None runs of length > 2)

It looks like you want to also include the first following null as part of the run-length - which could be done by forward filling 1 step.

df.with_columns(
    pl.when(pl.col("cdl_type").is_not_null())
      .then(pl.col("cdl_type").is_not_null().rle_id())
      .forward_fill(limit=1)
      .rle_id()
      .alias("rle_id")
)
shape: (10, 3)
┌────────┬──────────┬────────┐
│ before ┆ cdl_type ┆ rle_id │
│ ---    ┆ ---      ┆ ---    │
│ i64    ┆ str      ┆ u32    │
╞════════╪══════════╪════════╡
│ 0      ┆ REC      ┆ 0      │
│ 0      ┆ REC      ┆ 0      │
│ 0      ┆ GEC      ┆ 0      │
│ 0      ┆ null     ┆ 0      │ # <-
│ 0      ┆ null     ┆ 1      │
│ 0      ┆ null     ┆ 1      │
│ 0      ┆ GEC      ┆ 2      │
│ 0      ┆ null     ┆ 2      │ # <-
│ 0      ┆ REC      ┆ 3      │
│ 0      ┆ GEC      ┆ 3      │
└────────┴──────────┴────────┘

rle() gives a struct containing each {len,value}

df.select(
    pl.when(pl.col("cdl_type").is_not_null())
      .then(pl.col("cdl_type").is_not_null().rle_id())
      .forward_fill(limit=1)
      .rle()
)
shape: (4, 1)
┌───────────┐
│ cdl_type  │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {4,0}     │
│ {2,null}  │
│ {2,2}     │
│ {2,4}     │
└───────────┘

The len values are given to int_ranges() and flattened to create the count column.

df.with_columns(
    pl.int_ranges(
        pl.when(pl.col("cdl_type").is_not_null())
          .then(pl.col("cdl_type").is_not_null().rle_id())
          .forward_fill(limit=1)
          .rle()
          .struct.field("len")
    )
    .flatten()
    .alias("after")
)
shape: (10, 3)
┌────────┬──────────┬───────┐
│ before ┆ cdl_type ┆ after │
│ ---    ┆ ---      ┆ ---   │
│ i64    ┆ str      ┆ i64   │
╞════════╪══════════╪═══════╡
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ REC      ┆ 1     │
│ 0      ┆ GEC      ┆ 2     │
│ 0      ┆ null     ┆ 3     │
│ 0      ┆ null     ┆ 0     │
│ 0      ┆ null     ┆ 1     │ # <- NOT OK
│ 0      ┆ GEC      ┆ 0     │
│ 0      ┆ null     ┆ 1     │
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ GEC      ┆ 1     │
└────────┴──────────┴───────┘

We then set the remaining null values back to 0

df.with_columns(
    pl.int_ranges(
        pl.when(pl.col("cdl_type").is_not_null())
          .then(pl.col("cdl_type").is_not_null().rle_id())
          .forward_fill(limit=1)
          .rle()
          .struct.field("len")
    )
    .flatten()
    .alias("after")
).with_columns(
    pl.when(pl.col("cdl_type").shift().is_not_null())
      .then(pl.col("after"))
      .otherwise(0)
)
shape: (10, 3)
┌────────┬──────────┬───────┐
│ before ┆ cdl_type ┆ after │
│ ---    ┆ ---      ┆ ---   │
│ i64    ┆ str      ┆ i64   │
╞════════╪══════════╪═══════╡
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ REC      ┆ 1     │
│ 0      ┆ GEC      ┆ 2     │
│ 0      ┆ null     ┆ 3     │
│ 0      ┆ null     ┆ 0     │
│ 0      ┆ null     ┆ 0     │ # <- OK
│ 0      ┆ GEC      ┆ 0     │
│ 0      ┆ null     ┆ 1     │
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ GEC      ┆ 1     │
└────────┴──────────┴───────┘

The reason for this approach is that it avoids using .over() - and if dealing with larger dataframes produces much faster results.

Comments

0
df.with_columns(
    after = pl.int_range(pl.len()).over(
        pl.col.cdl_type
        .is_in(["REC", "GEC"])
        .fill_null(pl.int_range(pl.len()) + 2)
        .rle_id()
    )
    .add(~pl.col.cdl_type.is_null()).shift().fill_null(0)
)
shape: (9, 3)
┌────────┬──────────┬───────┐
│ before ┆ cdl_type ┆ after │
│ ---    ┆ ---      ┆ ---   │
│ i64    ┆ str      ┆ i64   │
╞════════╪══════════╪═══════╡
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ REC      ┆ 1     │
│ 0      ┆ GEC      ┆ 2     │
│ 0      ┆ null     ┆ 3     │
│ 0      ┆ null     ┆ 0     │
│ 0      ┆ GEC      ┆ 0     │
│ 0      ┆ null     ┆ 1     │
│ 0      ┆ REC      ┆ 0     │
│ 0      ┆ GEC      ┆ 1     │
└────────┴──────────┴───────┘

4 Comments

This fails if cdl_type has 2 leading nulls, no? As the 1 from .fill_null(pl.int_range(pl.len())) would contribute to the first true condition block.
true, thanks for that, looks like simple +1 can fix that though (or pl.int_range(1, pl.len() + 1)
I think a +2 would be needed. Otherwise, a single leading null leads to the same effect. +2 should guarantee that you'll never have a 1 in the int range, that could contribute to a true condition block.
[facepalm] of course cause int range starts from 0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.