Expand/Unnest Polars struct into rows, not into columns

Question

I have this DataFrame

import polars as pl

df = pl.DataFrame({
    'as_of':    ['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04'],
    'quantity': [{'A': 10, 'B': 5}, {'A': 11, 'B': 7}, {'A': 9, 'B': 4, 'C': -3},
                 {'A': 15, 'B': 3, 'C': -14, 'D': 50}]
}, schema={'as_of': pl.String, 'quantity': pl.Struct})

shape: (4, 2)
┌────────────┬──────────────────┐
│ as_of      ┆ quantity         │
│ ---        ┆ ---              │
│ str        ┆ struct[4]        │
╞════════════╪══════════════════╡
│ 2024-08-01 ┆ {10,5,null,null} │
│ 2024-08-02 ┆ {11,7,null,null} │
│ 2024-08-03 ┆ {9,4,-3,null}    │
│ 2024-08-04 ┆ {15,3,-14,50}    │
└────────────┴──────────────────┘

Which if I unnest

df.unnest('quantity')

Gives me the following

shape: (4, 5)
┌────────────┬─────┬─────┬──────┬──────┐
│ as_of      ┆ A   ┆ B   ┆ C    ┆ D    │
│ ---        ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ str        ┆ i64 ┆ i64 ┆ i64  ┆ i64  │
╞════════════╪═════╪═════╪══════╪══════╡
│ 2024-08-01 ┆ 10  ┆ 5   ┆ null ┆ null │
│ 2024-08-02 ┆ 11  ┆ 7   ┆ null ┆ null │
│ 2024-08-03 ┆ 9   ┆ 4   ┆ -3   ┆ null │
│ 2024-08-04 ┆ 15  ┆ 3   ┆ -14  ┆ 50   │
└────────────┴─────┴─────┴──────┴──────┘

Instead of each unnesting into columns, can I unnest into rows to get a dataframe like so?

shape: (11, 3)
┌────────────┬──────┬──────────┐
│ as_of      ┆ name ┆ quantity │
│ ---        ┆ ---  ┆ ---      │
│ str        ┆ str  ┆ i64      │
╞════════════╪══════╪══════════╡
│ 2024-08-01 ┆ A    ┆ 10       │
│ 2024-08-01 ┆ B    ┆ 5        │
│ 2024-08-02 ┆ A    ┆ 11       │
│ 2024-08-02 ┆ B    ┆ 7        │
│ 2024-08-03 ┆ A    ┆ 9        │
│ …          ┆ …    ┆ …        │
│ 2024-08-03 ┆ C    ┆ -3       │
│ 2024-08-04 ┆ A    ┆ 15       │
│ 2024-08-04 ┆ B    ┆ 3        │
│ 2024-08-04 ┆ C    ┆ -14      │
│ 2024-08-04 ┆ D    ┆ 50       │
└────────────┴──────┴──────────┘

Dean MacGregor · Accepted Answer · 2024-08-12 17:43:01Z

You can't do it in one step but what you're after is an unpivot (used to be melt).

(
    df
    .unnest('quantity')
    .unpivot(
        index='as_of', 
        variable_name='name',
        value_name='quantity'
        )
    .filter(pl.col('quantity').is_not_null())
    .sort('as_of')
    )
shape: (11, 3)
┌────────────┬──────┬──────────┐
│ as_of      ┆ name ┆ quantity │
│ ---        ┆ ---  ┆ ---      │
│ str        ┆ str  ┆ i64      │
╞════════════╪══════╪══════════╡
│ 2024-08-01 ┆ A    ┆ 10       │
│ 2024-08-01 ┆ B    ┆ 5        │
│ 2024-08-02 ┆ A    ┆ 11       │
│ 2024-08-02 ┆ B    ┆ 7        │
│ 2024-08-03 ┆ A    ┆ 9        │
│ …          ┆ …    ┆ …        │
│ 2024-08-03 ┆ C    ┆ -3       │
│ 2024-08-04 ┆ A    ┆ 15       │
│ 2024-08-04 ┆ B    ┆ 3        │
│ 2024-08-04 ┆ C    ┆ -14      │
│ 2024-08-04 ┆ D    ┆ 50       │
└────────────┴──────┴──────────┘

Cameron Riddell · Accepted Answer · 2024-08-13 13:50:59Z

If you have a LazyFrame and want to avoid materializing your data, you can avoid unpivot by relying on schema knowledge since Polars is aware of the fields on a struct. This has the added (but not necessary) benefit of readily enabling us to turn the resultant 'name' column to an Enum type instead of a string type.

import polars as pl

df = pl.DataFrame({
    'as_of':    ['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04'],
    'quantity': [{'A': 10, 'B': 5}, {'A': 11, 'B': 7}, {'A': 9, 'B': 4, 'C': -3},
                 {'A': 15, 'B': 3, 'C': -14, 'D': 50}]
}, schema={'as_of': pl.String, 'quantity': pl.Struct}).lazy()

# Casting to Enum is not necessary
NameDType = pl.Enum([field for field, _ in df.schema['quantity']])

result = (
    pl.concat(
        items=[
            df.select(
                pl.all().exclude('quantity'),
                pl.lit(field).cast(NameDType).alias('name'),
                pl.col('quantity').struct.field(field).alias('quantity'),
            )
            .drop_nulls(subset='quantity')

            for field, _ in df.schema['quantity']
        ],
        how='vertical',
    )
)

print(result.collect())
# shape: (11, 3)
# ┌────────────┬──────┬──────────┐
# │ as_of      ┆ name ┆ quantity │
# │ ---        ┆ ---  ┆ ---      │
# │ str        ┆ enum ┆ i64      │
# ╞════════════╪══════╪══════════╡
# │ 2024-08-01 ┆ A    ┆ 10       │
# │ 2024-08-02 ┆ A    ┆ 11       │
# │ 2024-08-03 ┆ A    ┆ 9        │
# │ 2024-08-04 ┆ A    ┆ 15       │
# │ 2024-08-01 ┆ B    ┆ 5        │
# │ …          ┆ …    ┆ …        │
# │ 2024-08-03 ┆ B    ┆ 4        │
# │ 2024-08-04 ┆ B    ┆ 3        │
# │ 2024-08-03 ┆ C    ┆ -3       │
# │ 2024-08-04 ┆ C    ┆ -14      │
# │ 2024-08-04 ┆ D    ┆ 50       │
# └────────────┴──────┴──────────┘

Where

import polars as pl

df = pl.DataFrame({
    'as_of':    ['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04'],
    'quantity': [{'A': 10, 'B': 5}, {'A': 11, 'B': 7}, {'A': 9, 'B': 4, 'C': -3},
                 {'A': 15, 'B': 3, 'C': -14, 'D': 50}]
}, schema={'as_of': pl.String, 'quantity': pl.Struct}).lazy()

categories = [field for field, _ in df.schema['quantity']]
NameDType = pl.Enum(categories)
print(
    categories,
    NameDType,
    sep='\n'
)

Collectives™ on Stack Overflow

Expand/Unnest Polars struct into rows, not into columns

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related