3

I have this DataFrame

import polars as pl

df = pl.DataFrame({
    'as_of':    ['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04'],
    'quantity': [{'A': 10, 'B': 5}, {'A': 11, 'B': 7}, {'A': 9, 'B': 4, 'C': -3},
                 {'A': 15, 'B': 3, 'C': -14, 'D': 50}]
}, schema={'as_of': pl.String, 'quantity': pl.Struct})
shape: (4, 2)
┌────────────┬──────────────────┐
│ as_of      ┆ quantity         │
│ ---        ┆ ---              │
│ str        ┆ struct[4]        │
╞════════════╪══════════════════╡
│ 2024-08-01 ┆ {10,5,null,null} │
│ 2024-08-02 ┆ {11,7,null,null} │
│ 2024-08-03 ┆ {9,4,-3,null}    │
│ 2024-08-04 ┆ {15,3,-14,50}    │
└────────────┴──────────────────┘

Which if I unnest

df.unnest('quantity')

Gives me the following

shape: (4, 5)
┌────────────┬─────┬─────┬──────┬──────┐
│ as_of      ┆ A   ┆ B   ┆ C    ┆ D    │
│ ---        ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ str        ┆ i64 ┆ i64 ┆ i64  ┆ i64  │
╞════════════╪═════╪═════╪══════╪══════╡
│ 2024-08-01 ┆ 10  ┆ 5   ┆ null ┆ null │
│ 2024-08-02 ┆ 11  ┆ 7   ┆ null ┆ null │
│ 2024-08-03 ┆ 9   ┆ 4   ┆ -3   ┆ null │
│ 2024-08-04 ┆ 15  ┆ 3   ┆ -14  ┆ 50   │
└────────────┴─────┴─────┴──────┴──────┘

Instead of each unnesting into columns, can I unnest into rows to get a dataframe like so?

shape: (11, 3)
┌────────────┬──────┬──────────┐
│ as_of      ┆ name ┆ quantity │
│ ---        ┆ ---  ┆ ---      │
│ str        ┆ str  ┆ i64      │
╞════════════╪══════╪══════════╡
│ 2024-08-01 ┆ A    ┆ 10       │
│ 2024-08-01 ┆ B    ┆ 5        │
│ 2024-08-02 ┆ A    ┆ 11       │
│ 2024-08-02 ┆ B    ┆ 7        │
│ 2024-08-03 ┆ A    ┆ 9        │
│ …          ┆ …    ┆ …        │
│ 2024-08-03 ┆ C    ┆ -3       │
│ 2024-08-04 ┆ A    ┆ 15       │
│ 2024-08-04 ┆ B    ┆ 3        │
│ 2024-08-04 ┆ C    ┆ -14      │
│ 2024-08-04 ┆ D    ┆ 50       │
└────────────┴──────┴──────────┘

2 Answers 2

3

You can't do it in one step but what you're after is an unpivot (used to be melt).

(
    df
    .unnest('quantity')
    .unpivot(
        index='as_of', 
        variable_name='name',
        value_name='quantity'
        )
    .filter(pl.col('quantity').is_not_null())
    .sort('as_of')
    )
shape: (11, 3)
┌────────────┬──────┬──────────┐
│ as_of      ┆ name ┆ quantity │
│ ---        ┆ ---  ┆ ---      │
│ str        ┆ str  ┆ i64      │
╞════════════╪══════╪══════════╡
│ 2024-08-01 ┆ A    ┆ 10       │
│ 2024-08-01 ┆ B    ┆ 5        │
│ 2024-08-02 ┆ A    ┆ 11       │
│ 2024-08-02 ┆ B    ┆ 7        │
│ 2024-08-03 ┆ A    ┆ 9        │
│ …          ┆ …    ┆ …        │
│ 2024-08-03 ┆ C    ┆ -3       │
│ 2024-08-04 ┆ A    ┆ 15       │
│ 2024-08-04 ┆ B    ┆ 3        │
│ 2024-08-04 ┆ C    ┆ -14      │
│ 2024-08-04 ┆ D    ┆ 50       │
└────────────┴──────┴──────────┘
Sign up to request clarification or add additional context in comments.

Comments

2

If you have a LazyFrame and want to avoid materializing your data, you can avoid unpivot by relying on schema knowledge since Polars is aware of the fields on a struct. This has the added (but not necessary) benefit of readily enabling us to turn the resultant 'name' column to an Enum type instead of a string type.

import polars as pl

df = pl.DataFrame({
    'as_of':    ['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04'],
    'quantity': [{'A': 10, 'B': 5}, {'A': 11, 'B': 7}, {'A': 9, 'B': 4, 'C': -3},
                 {'A': 15, 'B': 3, 'C': -14, 'D': 50}]
}, schema={'as_of': pl.String, 'quantity': pl.Struct}).lazy()

# Casting to Enum is not necessary
NameDType = pl.Enum([field for field, _ in df.schema['quantity']])

result = (
    pl.concat(
        items=[
            df.select(
                pl.all().exclude('quantity'),
                pl.lit(field).cast(NameDType).alias('name'),
                pl.col('quantity').struct.field(field).alias('quantity'),
            )
            .drop_nulls(subset='quantity')

            for field, _ in df.schema['quantity']
        ],
        how='vertical',
    )
)

print(result.collect())
# shape: (11, 3)
# ┌────────────┬──────┬──────────┐
# │ as_of      ┆ name ┆ quantity │
# │ ---        ┆ ---  ┆ ---      │
# │ str        ┆ enum ┆ i64      │
# ╞════════════╪══════╪══════════╡
# │ 2024-08-01 ┆ A    ┆ 10       │
# │ 2024-08-02 ┆ A    ┆ 11       │
# │ 2024-08-03 ┆ A    ┆ 9        │
# │ 2024-08-04 ┆ A    ┆ 15       │
# │ 2024-08-01 ┆ B    ┆ 5        │
# │ …          ┆ …    ┆ …        │
# │ 2024-08-03 ┆ B    ┆ 4        │
# │ 2024-08-04 ┆ B    ┆ 3        │
# │ 2024-08-03 ┆ C    ┆ -3       │
# │ 2024-08-04 ┆ C    ┆ -14      │
# │ 2024-08-04 ┆ D    ┆ 50       │
# └────────────┴──────┴──────────┘

Where

import polars as pl

df = pl.DataFrame({
    'as_of':    ['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04'],
    'quantity': [{'A': 10, 'B': 5}, {'A': 11, 'B': 7}, {'A': 9, 'B': 4, 'C': -3},
                 {'A': 15, 'B': 3, 'C': -14, 'D': 50}]
}, schema={'as_of': pl.String, 'quantity': pl.Struct}).lazy()

categories = [field for field, _ in df.schema['quantity']]
NameDType = pl.Enum(categories)
print(
    categories,
    NameDType,
    sep='\n'
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.