We will add an extra null row into the sample frame (to help verify the result).
df = pl.DataFrame({
"before": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"cdl_type": ["REC", "REC", "GEC", None, None, None, "GEC", None, "REC", "GEC"],
})
(My initial answer produced the expected output for the sample provided, but was broken for None runs of length > 2)
It looks like you want to also include the first following null as part of the run-length - which could be done by forward filling 1 step.
df.with_columns(
pl.when(pl.col("cdl_type").is_not_null())
.then(pl.col("cdl_type").is_not_null().rle_id())
.forward_fill(limit=1)
.rle_id()
.alias("rle_id")
)
shape: (10, 3)
┌────────┬──────────┬────────┐
│ before ┆ cdl_type ┆ rle_id │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞════════╪══════════╪════════╡
│ 0 ┆ REC ┆ 0 │
│ 0 ┆ REC ┆ 0 │
│ 0 ┆ GEC ┆ 0 │
│ 0 ┆ null ┆ 0 │ # <-
│ 0 ┆ null ┆ 1 │
│ 0 ┆ null ┆ 1 │
│ 0 ┆ GEC ┆ 2 │
│ 0 ┆ null ┆ 2 │ # <-
│ 0 ┆ REC ┆ 3 │
│ 0 ┆ GEC ┆ 3 │
└────────┴──────────┴────────┘
rle() gives a struct containing each {len,value}
df.select(
pl.when(pl.col("cdl_type").is_not_null())
.then(pl.col("cdl_type").is_not_null().rle_id())
.forward_fill(limit=1)
.rle()
)
shape: (4, 1)
┌───────────┐
│ cdl_type │
│ --- │
│ struct[2] │
╞═══════════╡
│ {4,0} │
│ {2,null} │
│ {2,2} │
│ {2,4} │
└───────────┘
The len values are given to int_ranges() and flattened to create the count column.
df.with_columns(
pl.int_ranges(
pl.when(pl.col("cdl_type").is_not_null())
.then(pl.col("cdl_type").is_not_null().rle_id())
.forward_fill(limit=1)
.rle()
.struct.field("len")
)
.flatten()
.alias("after")
)
shape: (10, 3)
┌────────┬──────────┬───────┐
│ before ┆ cdl_type ┆ after │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞════════╪══════════╪═══════╡
│ 0 ┆ REC ┆ 0 │
│ 0 ┆ REC ┆ 1 │
│ 0 ┆ GEC ┆ 2 │
│ 0 ┆ null ┆ 3 │
│ 0 ┆ null ┆ 0 │
│ 0 ┆ null ┆ 1 │ # <- NOT OK
│ 0 ┆ GEC ┆ 0 │
│ 0 ┆ null ┆ 1 │
│ 0 ┆ REC ┆ 0 │
│ 0 ┆ GEC ┆ 1 │
└────────┴──────────┴───────┘
We then set the remaining null values back to 0
df.with_columns(
pl.int_ranges(
pl.when(pl.col("cdl_type").is_not_null())
.then(pl.col("cdl_type").is_not_null().rle_id())
.forward_fill(limit=1)
.rle()
.struct.field("len")
)
.flatten()
.alias("after")
).with_columns(
pl.when(pl.col("cdl_type").shift().is_not_null())
.then(pl.col("after"))
.otherwise(0)
)
shape: (10, 3)
┌────────┬──────────┬───────┐
│ before ┆ cdl_type ┆ after │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞════════╪══════════╪═══════╡
│ 0 ┆ REC ┆ 0 │
│ 0 ┆ REC ┆ 1 │
│ 0 ┆ GEC ┆ 2 │
│ 0 ┆ null ┆ 3 │
│ 0 ┆ null ┆ 0 │
│ 0 ┆ null ┆ 0 │ # <- OK
│ 0 ┆ GEC ┆ 0 │
│ 0 ┆ null ┆ 1 │
│ 0 ┆ REC ┆ 0 │
│ 0 ┆ GEC ┆ 1 │
└────────┴──────────┴───────┘
The reason for this approach is that it avoids using .over() - and if dealing with larger dataframes produces much faster results.