4

I've written a custom function in Polars to generate a horizontal forward/backward fill list of expressions. The function accepts an iterable of expressions (or column names) to determine the order of filling. I want to to use all columns via pl.all() as default. The problem is that pl.all() returns a single expression rather than an iterable, so trying to reverse or iterate over it leads to a TypeError.

Is there a way to convert between single expressions and iterables of expressions? Any suggestions or workarounds are greatly appreciated!

Here is the function:

from typing import Iterable
from polars._typing import IntoExpr
import polars as pl

def fill_horizontal(exprs: Iterable[IntoExpr], forward: bool = True) -> list[pl.Expr]:
    """Generate a horizontal forward/backward fill list of expressions."""
    # exprs = exprs or pl.all()  # use all columns as default
    cols = [col for col in reversed(exprs)] if forward else exprs
    return [pl.coalesce(cols[i:]) for i in range(0, len(cols) - 1)]

Here is an example:

df = pl.DataFrame({
    "col1": [1, None, 2],
    "col2": [1, 2, None],
    "col3": [None, None, 3]})
print(df)
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ col1 ┆ col2 ┆ col3 │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ i64  │
# ╞══════╪══════╪══════╡
# │ 1    ┆ 1    ┆ null │
# │ null ┆ 2    ┆ null │
# │ 2    ┆ null ┆ 3    │
# └──────┴──────┴──────┘
print('forward_fill')
print(df.with_columns(fill_horizontal(df.columns, forward=True)))
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ col1 ┆ col2 ┆ col3 │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ i64  │
# ╞══════╪══════╪══════╡
# │ 1    ┆ 1    ┆ 1    │
# │ null ┆ 2    ┆ 2    │
# │ 2    ┆ 2    ┆ 3    │
# └──────┴──────┴──────┘
print('backward_fill')
print(df.with_columns(fill_horizontal(df.columns, forward=False)))
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ col1 ┆ col2 ┆ col3 │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ i64  │
# ╞══════╪══════╪══════╡
# │ 1    ┆ 1    ┆ null │
# │ 2    ┆ 2    ┆ null │
# │ 2    ┆ 3    ┆ 3    │
# └──────┴──────┴──────┘

Edit: Merging @Henry Harbeck's answer and @jqurious's comment seems to be not perfect but a sufficient solution as of now.

def fill_horizontal(
        exprs: Iterable[IntoExpr] | None = None,
        *,
        forward: bool = True,
        ncols: int = 1000) -> pl.Expr:
    """Generate a horizontal forward/backward fill expression."""
    if exprs is None:
        # if forward is false, ncols has to be defined with the present number of cols or more
        cols = pl.all() if forward else pl.nth(range(ncols, -1, -1))
    else:
        cols = exprs if forward else reversed(exprs)
    return pl.cum_reduce(lambda s1, s2: pl.coalesce(s2, s1), cols).struct.unnest()
8
  • Am I correct that you want to be able to use your functions but with pl.all() instead of df.columns? Sorry for being dense I'm on mobile. Commented Feb 22 at 16:56
  • I want to be able to provide specific columns or no columns at all as a function argument. If no columns are provided, all columns should be used by default (i.e., using pl.all()). To achieve this, pl.all() needs to return a list of expressions rather than a single multi-column expression. I'd also like to use pl.all() as function argument, too. Commented Feb 22 at 17:04
  • 1
    @DeanMacGregor I think they want to have default argument, such that df.with_columns(fill_horizontal()) might be called. Especially, they want a generic expression to iterate over all columns available in the enclosing context. Commented Feb 22 at 17:04
  • It looks like you want to reverse the order pl.all() which I don't think is currently possible with just expressions. The closest is nth but it requires you specify the number of columns. You could use some "huge" number pl.nth(range(100_000, -1, -1)) but it would break if the frame had more columns. Commented Feb 22 at 17:11
  • 1
    @jqurious Even if no reversal was not required, pl.all() couldn't be used as one cannot do pl.coalesce(...) for i in len(pl.all()), no? Commented Feb 22 at 17:13

1 Answer 1

3

Check out cum_reduce, which does a cumulative horizontal reduction. This is pretty much what you are after and saves you having to do any Python looping.

Unfortunately, it reduces from left to right only. I've made this feature request to ask for right to left reductions, which should fully enable your use-case.

Here's a tweaked version of your function that works in a cases except pl.all() and forward=False

def fill_horizontal(
    exprs: Iterable[IntoExpr] | None = None,
    *,
    forward: bool = True
) -> pl.Expr:
    """Generate a horizontal forward/backward fill list of expressions."""
    exprs = exprs or [pl.all()]  # use all columns as default
    # Doesn't do anything for pl.all() - columns still remain in their original order 
    cols = exprs if forward else reversed(exprs)
    return pl.cum_reduce(lambda s1, s2: pl.coalesce(s2, s1), cols).struct.unnest()

df.with_columns(fill_horizontal())
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ col1 ┆ col2 ┆ col3 │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ i64  │
# ╞══════╪══════╪══════╡
# │ 1    ┆ 1    ┆ 1    │
# │ null ┆ 2    ┆ 2    │
# │ 2    ┆ 2    ┆ 3    │
# └──────┴──────┴──────┘

# Doesn't work :(
df.with_columns(fill_horizontal(forward=False))

# Works as a backward fill
df.with_columns(fill_horizontal(df.columns, forward=False))

Other options I can think of are:

  • make this a DataFrame / LazyFrame level function. You can pipe the frame in and access the schema directly. You can then reversthe columns without needing to expose this to the caller. This may block some optimisations / lazy evaluation
  • make a feature request to reverse the column order of multi-column expressions such as pl.all() and pl.col("a", "b")
  • upvote the feature request I've linked, and force the caller of your function to use df.columns until it hopefully gets implemented
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, @Henry Harbeck. That’s a great answer! For exprs=None, we could use pl.nth(range(..., -1, -1)) as a workaround for backward fill while pl.col_reversed('*') is not available. It’s not the prettiest or safest approach, but it gets the job done.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.