Expand Struct columns into rows in Polars

Question

Say we have this dataframe:

import polars as pl

df = pl.DataFrame({'EU': {'size': 10, 'GDP': 80},
                   'US': {'size': 100, 'GDP': 800},
                   'AS': {'size': 80, 'GDP': 500}})

shape: (1, 3)
┌───────────┬───────────┬───────────┐
│ EU        ┆ US        ┆ AS        │
│ ---       ┆ ---       ┆ ---       │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╪═══════════╡
│ {10,80}   ┆ {100,800} ┆ {80,500}  │
└───────────┴───────────┴───────────┘

I am looking for a function like df.expand_structs(column_name='metric') that gives

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GBP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

I've tried other functions like unnest, explode but no luck. Any help appreciated!

ouroboros1 · Accepted Answer · 2025-03-29 10:48:04Z

TL;DR

Performance comparison at the end.

Both @etrotta's method and @DeanMacGregor's adjustment perform well on a pl.lazyframe with small Structs (e.g., struct[2]) and columns N <= 15 (not collected). Other methods fail lazily.

With bigger Structs and/or columns N > 15, both unpivot options below start to outperform. Other suggested methods thus far slower in general.

Option 1

out = (df.unpivot()
       .unnest('value')
       .select(pl.exclude('variable'))
       .transpose(include_header=True)
       .pipe(
           lambda x: x.rename(
               dict(zip(x.columns, ['metric'] + df.columns))
               )
           )
       )

Output:

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GDP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

Explanation / Intermediates

Use df.unpivot:

shape: (3, 2)
┌──────────┬───────────┐
│ variable ┆ value     │
│ ---      ┆ ---       │
│ str      ┆ struct[2] │
╞══════════╪═══════════╡
│ EU       ┆ {10,80}   │
│ US       ┆ {100,800} │
│ AS       ┆ {80,500}  │
└──────────┴───────────┘

So that we can apply df.unnest on new 'value' column:

shape: (3, 3)
┌──────────┬──────┬─────┐
│ variable ┆ size ┆ GDP │
│ ---      ┆ ---  ┆ --- │
│ str      ┆ i64  ┆ i64 │
╞══════════╪══════╪═════╡
│ EU       ┆ 10   ┆ 80  │
│ US       ┆ 100  ┆ 800 │
│ AS       ┆ 80   ┆ 500 │
└──────────┴──────┴─────┘

Use df.select to exclude 'variable' column (pl.exclude) and df.transpose with include_header=True:

shape: (2, 4)
┌────────┬──────────┬──────────┬──────────┐
│ column ┆ column_0 ┆ column_1 ┆ column_2 │
│ ---    ┆ ---      ┆ ---      ┆ ---      │
│ str    ┆ i64      ┆ i64      ┆ i64      │
╞════════╪══════════╪══════════╪══════════╡
│ size   ┆ 10       ┆ 100      ┆ 80       │
│ GDP    ┆ 80       ┆ 800      ┆ 500      │
└────────┴──────────┴──────────┴──────────┘

Now, we just need to rename the columns. Here done via df.pipe + df.rename. Without the chained operation, that can also be:

out.columns = ['metric'] + df.columns

Option 2

out2 = (df.unpivot()
        .unnest('value')
        .unpivot(index='variable', variable_name='metric')
        .pivot(on='variable', index='metric')
        )

Equality check:

out.equals(out2)
# True

Explanation / Intermediates

Same start as option 1, but followed by a second df.unpivot to get:

shape: (6, 3)
┌────────┬────────┬───────┐
│ column ┆ metric ┆ value │
│ ---    ┆ ---    ┆ ---   │
│ str    ┆ str    ┆ i64   │
╞════════╪════════╪═══════╡
│ EU     ┆ size   ┆ 10    │
│ US     ┆ size   ┆ 100   │
│ AS     ┆ size   ┆ 80    │
│ EU     ┆ GDP    ┆ 80    │
│ US     ┆ GDP    ┆ 800   │
│ AS     ┆ GDP    ┆ 500   │
└────────┴────────┴───────┘

Followed by df.pivot on 'column' with 'metric' as the index to get desired shape.

Performance comparison (gist)

Columns: n_range=[2**k for k in range(12)]

Struct: 2, 20, 100

Methods compared:

unpivot_unnest_t (option 1), #@ouroboros1
unpivot_unnest_t2 (option 1, adj)
unpivot_pivot (option 2)
concat_list_expl, #@etrotta
concat_list_expl_lazy, #lazy
concat_list_expl2, #@etrotta, #@DeanMacGregor
concat_list_expl2_lazy, #lazy
map_batches, #@DeanMacGregor
loop, #@sammywemmy

Results:

Not sure if you can refactor the pivot but you can start with df.unpivot() instead of transpose.
@jqurious: thanks, cleaner indeed. Added another option with .select + .transpose.
Great job @ouroboros1. Am I right to conclude that the test is only width wise (increasing number of columns) and not length wise (increasing number of rows)? Basically a single row with lots of structs and lots of columns?

etrotta · Accepted Answer · 2025-03-26 18:13:55Z

3

Working with Structs typically gets a bit awkward when you have multiple columns with the same fields, I would first turn into lists then explode

schema = df.collect_schema()
countries = schema.names()
# countries = ['EU', 'US', 'AS']
metrics = [field.name for field in schema[countries[0]].fields]
# metrics = ['size', 'GDP']

df.select(
    pl.lit(metrics).alias("metrics"),
    *(pl.concat_list(
        pl.col(country).struct.field(metric)
        for metric in metrics
    ).alias(country) for country in countries),
).explode(pl.all())

answered Mar 26 at 18:13

etrotta

1,0721 silver badge9 bronze badges

2 Comments

Dean MacGregor Mar 26 at 19:20

minor tweak, instead of lit do pl.Series("metrics",metrics) and instead of using .explode(pl.all()) on the df, put the explode before the alias at the Expr level. This way it only does 3 explodes instead of 4 and it can do them in parallel. Looking at a profile of the two shows a big difference although I'm not sure what units profile uses.

ouroboros1 Mar 29 at 10:53

Also very nice about this method is that it works on a pl.Lazyframe. @DeanMacGregor: added performance comparison to my answer, also your variant here.

jqurious · Accepted Answer · 2025-03-30 12:25:26Z

A variation of the answer from @erotta without exploding.

schema = df.collect_schema()
countries = schema.names()
metrics = list(schema[countries[0]].to_schema())

metric = pl.concat(
    pl.repeat(metric, pl.len()).alias("metric")
    for metric in metrics
)

values = [
    pl.concat([pl.col(country).struct.field(metrics)]).alias(country) 
    for country in countries
] 

df.select(metric, *values)

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GDP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

Dean MacGregor · Accepted Answer · 2025-03-26 19:02:21Z

I think etrotta's method will be more efficient but here's a way that is syntactically shorter

df.select(
    pl.Series('metric', (metrics:=[x.name for x in df.dtypes[0].fields])),
    pl.all().map_batches(lambda s: (
        s.to_frame().unnest(s.name)
        .select(pl.concat_list(metrics).explode())
        .to_series().alias(s.name)
    ))
    )

Note the walrus operator in in the Series and then the reuse of metrics in concat_list. If you're confident that the fields will be in the same order in each of your structs then you could forego the walrus and just use pl.all() inside the concat_list.

Alternatively, if you don't like referring to the df inside of its own context then you could create the metrics column this way which assumes all the structs' fields will be in the same order.

df.select(
    pl.first().map_batches(lambda s: pl.Series(s.struct.fields)).alias('metrics'),
    pl.all().map_batches(lambda s: (
        s.to_frame().unnest(s.name)
        .select(pl.concat_list(pl.all()).explode())
        .to_series().alias(s.name)
    ))
    )
shape: (2, 4)
┌─────────┬─────┬─────┬─────┐
│ metrics ┆ EU  ┆ US  ┆ AS  │
│ ---     ┆ --- ┆ --- ┆ --- │
│ str     ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════╪═════╪═════╡
│ size    ┆ 10  ┆ 100 ┆ 80  │
│ GDP     ┆ 80  ┆ 800 ┆ 500 │
└─────────┴─────┴─────┴─────┘

sammywemmy · Accepted Answer · 2025-03-29 04:44:25Z

0

speedwise (and simplicity maybe), I would suggest using a for loop to create the individual Series, and then create a new DataFrame. This approach is faster than @etrotta's excellent work:

import polars as pl

# reusing @etrotta's work:
schema = df.collect_schema()
countries = schema.names()
# countries = ['EU', 'US', 'AS']
metrics = [field.name for field in schema[countries[0]].fields]
# metrics = ['size', 'GDP']


# build a dictionary of Series
# and subsequently create a new DataFrame
mapping = {}
for country in countries:
    array = []
    for metric in metrics:
        series = df.get_column(country).struct.field(metric)
        array.append(series)
    mapping[country] = pl.concat(array)

# if you are not opposed to using another library
# numpy.repeat fits in nicely here
# and should offer good perf as well
array = []
for metric in metrics:
    array.append(pl.repeat(metric,n=len(df),eager=True))
mapping['metrics']=pl.concat(array)
pl.DataFrame(mapping)

shape: (2, 4)
┌─────┬─────┬─────┬─────────┐
│ EU  ┆ US  ┆ AS  ┆ metrics │
│ --- ┆ --- ┆ --- ┆ ---     │
│ i64 ┆ i64 ┆ i64 ┆ str     │
╞═════╪═════╪═════╪═════════╡
│ 10  ┆ 100 ┆ 80  ┆ size    │
│ 80  ┆ 800 ┆ 500 ┆ GDP     │
└─────┴─────┴─────┴─────────┘

Of course the speed tests are based on your shared data; would it still be performant for a large number of columns (width, not length now the controlling factor)?

NB: If the field names could be accessed directly within a context, then that would probably offer even more performance, as everything would occur within the polars framework

edited Mar 29 at 4:44

answered Mar 29 at 4:37

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

3 Comments

ouroboros1 Mar 29 at 10:51

added performance comparison to my answer. @etrotta's answer is nice, because it can be done lazily. None of the others can. Otherwise, unpivot gives good performance, if my calculations are correct (gist available for review).

sammywemmy Mar 30 at 1:00

@ouroboros1 great job. Do you mind extending your tests to an increasing number of rows. Your current approach tests multiple structs with multiple columns, but for a single row. Curious to see if there is a change in performance

ouroboros1 Mar 31 at 12:02

I would, but solutions do not scale similarly with n rows > 1. My option 1 goes wide, option 2 breaks; so does Dean's; yours & jqurious' go long (dupl metrics), so does etrotta's but alternating. Neither one seems obvious desired output (and OP did not ask, so who's to say). E.g., df = pl.DataFrame({'A': [{'a': 1, 'b': 2},{'a': 3, 'b': 4}]}) with jqurious' method gives: pl.DataFrame({'metric': ['a', 'a', 'b', 'b'], 'A': [1, 3, 2, 4]}). Mine: pl.DataFrame({'metric': ['a', 'b'], 'A': [1, 2], 'column_1': [3, 4]}) and would require a col label tweak. A.1, A.2? Which one would be "correct"?

Collectives™ on Stack Overflow

Expand Struct columns into rows in Polars

5 Answers 5

3 Comments

2 Comments

Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

2 Comments

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related