5

Say we have this dataframe:

import polars as pl

df = pl.DataFrame({'EU': {'size': 10, 'GDP': 80},
                   'US': {'size': 100, 'GDP': 800},
                   'AS': {'size': 80, 'GDP': 500}})
shape: (1, 3)
┌───────────┬───────────┬───────────┐
│ EU        ┆ US        ┆ AS        │
│ ---       ┆ ---       ┆ ---       │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╪═══════════╡
│ {10,80}   ┆ {100,800} ┆ {80,500}  │
└───────────┴───────────┴───────────┘

I am looking for a function like df.expand_structs(column_name='metric') that gives

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GBP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

I've tried other functions like unnest, explode but no luck. Any help appreciated!

5 Answers 5

5

TL;DR

Performance comparison at the end.

Both @etrotta's method and @DeanMacGregor's adjustment perform well on a pl.lazyframe with small Structs (e.g., struct[2]) and columns N <= 15 (not collected). Other methods fail lazily.

With bigger Structs and/or columns N > 15, both unpivot options below start to outperform. Other suggested methods thus far slower in general.


Option 1

out = (df.unpivot()
       .unnest('value')
       .select(pl.exclude('variable'))
       .transpose(include_header=True)
       .pipe(
           lambda x: x.rename(
               dict(zip(x.columns, ['metric'] + df.columns))
               )
           )
       )

Output:

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GDP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

Explanation / Intermediates

shape: (3, 2)
┌──────────┬───────────┐
│ variable ┆ value     │
│ ---      ┆ ---       │
│ str      ┆ struct[2] │
╞══════════╪═══════════╡
│ EU       ┆ {10,80}   │
│ US       ┆ {100,800} │
│ AS       ┆ {80,500}  │
└──────────┴───────────┘
  • So that we can apply df.unnest on new 'value' column:
shape: (3, 3)
┌──────────┬──────┬─────┐
│ variable ┆ size ┆ GDP │
│ ---      ┆ ---  ┆ --- │
│ str      ┆ i64  ┆ i64 │
╞══════════╪══════╪═════╡
│ EU       ┆ 10   ┆ 80  │
│ US       ┆ 100  ┆ 800 │
│ AS       ┆ 80   ┆ 500 │
└──────────┴──────┴─────┘
shape: (2, 4)
┌────────┬──────────┬──────────┬──────────┐
│ column ┆ column_0 ┆ column_1 ┆ column_2 │
│ ---    ┆ ---      ┆ ---      ┆ ---      │
│ str    ┆ i64      ┆ i64      ┆ i64      │
╞════════╪══════════╪══════════╪══════════╡
│ size   ┆ 10       ┆ 100      ┆ 80       │
│ GDP    ┆ 80       ┆ 800      ┆ 500      │
└────────┴──────────┴──────────┴──────────┘
  • Now, we just need to rename the columns. Here done via df.pipe + df.rename. Without the chained operation, that can also be:
out.columns = ['metric'] + df.columns

Option 2

out2 = (df.unpivot()
        .unnest('value')
        .unpivot(index='variable', variable_name='metric')
        .pivot(on='variable', index='metric')
        )

Equality check:

out.equals(out2)
# True

Explanation / Intermediates

  • Same start as option 1, but followed by a second df.unpivot to get:
shape: (6, 3)
┌────────┬────────┬───────┐
│ column ┆ metric ┆ value │
│ ---    ┆ ---    ┆ ---   │
│ str    ┆ str    ┆ i64   │
╞════════╪════════╪═══════╡
│ EU     ┆ size   ┆ 10    │
│ US     ┆ size   ┆ 100   │
│ AS     ┆ size   ┆ 80    │
│ EU     ┆ GDP    ┆ 80    │
│ US     ┆ GDP    ┆ 800   │
│ AS     ┆ GDP    ┆ 500   │
└────────┴────────┴───────┘
  • Followed by df.pivot on 'column' with 'metric' as the index to get desired shape.

Performance comparison (gist)

Columns: n_range=[2**k for k in range(12)]

Struct: 2, 20, 100

Methods compared:

  • unpivot_unnest_t (option 1), #@ouroboros1
  • unpivot_unnest_t2 (option 1, adj)
  • unpivot_pivot (option 2)
  • concat_list_expl, #@etrotta
  • concat_list_expl_lazy, #lazy
  • concat_list_expl2, #@etrotta, #@DeanMacGregor
  • concat_list_expl2_lazy, #lazy
  • map_batches, #@DeanMacGregor
  • loop, #@sammywemmy

Results:

struct[2]

struct[20]

struct[100]

Sign up to request clarification or add additional context in comments.

3 Comments

Not sure if you can refactor the pivot but you can start with df.unpivot() instead of transpose.
@jqurious: thanks, cleaner indeed. Added another option with .select + .transpose.
Great job @ouroboros1. Am I right to conclude that the test is only width wise (increasing number of columns) and not length wise (increasing number of rows)? Basically a single row with lots of structs and lots of columns?
3

Working with Structs typically gets a bit awkward when you have multiple columns with the same fields, I would first turn into lists then explode

schema = df.collect_schema()
countries = schema.names()
# countries = ['EU', 'US', 'AS']
metrics = [field.name for field in schema[countries[0]].fields]
# metrics = ['size', 'GDP']

df.select(
    pl.lit(metrics).alias("metrics"),
    *(pl.concat_list(
        pl.col(country).struct.field(metric)
        for metric in metrics
    ).alias(country) for country in countries),
).explode(pl.all())

2 Comments

minor tweak, instead of lit do pl.Series("metrics",metrics) and instead of using .explode(pl.all()) on the df, put the explode before the alias at the Expr level. This way it only does 3 explodes instead of 4 and it can do them in parallel. Looking at a profile of the two shows a big difference although I'm not sure what units profile uses.
Also very nice about this method is that it works on a pl.Lazyframe. @DeanMacGregor: added performance comparison to my answer, also your variant here.
1

A variation of the answer from @erotta without exploding.

schema = df.collect_schema()
countries = schema.names()
metrics = list(schema[countries[0]].to_schema())

metric = pl.concat(
    pl.repeat(metric, pl.len()).alias("metric")
    for metric in metrics
)

values = [
    pl.concat([pl.col(country).struct.field(metrics)]).alias(country) 
    for country in countries
] 

df.select(metric, *values)
shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GDP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

Comments

0

I think etrotta's method will be more efficient but here's a way that is syntactically shorter

df.select(
    pl.Series('metric', (metrics:=[x.name for x in df.dtypes[0].fields])),
    pl.all().map_batches(lambda s: (
        s.to_frame().unnest(s.name)
        .select(pl.concat_list(metrics).explode())
        .to_series().alias(s.name)
    ))
    )

Note the walrus operator in in the Series and then the reuse of metrics in concat_list. If you're confident that the fields will be in the same order in each of your structs then you could forego the walrus and just use pl.all() inside the concat_list.

Alternatively, if you don't like referring to the df inside of its own context then you could create the metrics column this way which assumes all the structs' fields will be in the same order.

df.select(
    pl.first().map_batches(lambda s: pl.Series(s.struct.fields)).alias('metrics'),
    pl.all().map_batches(lambda s: (
        s.to_frame().unnest(s.name)
        .select(pl.concat_list(pl.all()).explode())
        .to_series().alias(s.name)
    ))
    )
shape: (2, 4)
┌─────────┬─────┬─────┬─────┐
│ metrics ┆ EU  ┆ US  ┆ AS  │
│ ---     ┆ --- ┆ --- ┆ --- │
│ str     ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════╪═════╪═════╡
│ size    ┆ 10  ┆ 100 ┆ 80  │
│ GDP     ┆ 80  ┆ 800 ┆ 500 │
└─────────┴─────┴─────┴─────┘

Comments

0

speedwise (and simplicity maybe), I would suggest using a for loop to create the individual Series, and then create a new DataFrame. This approach is faster than @etrotta's excellent work:

import polars as pl

# reusing @etrotta's work:
schema = df.collect_schema()
countries = schema.names()
# countries = ['EU', 'US', 'AS']
metrics = [field.name for field in schema[countries[0]].fields]
# metrics = ['size', 'GDP']


# build a dictionary of Series
# and subsequently create a new DataFrame
mapping = {}
for country in countries:
    array = []
    for metric in metrics:
        series = df.get_column(country).struct.field(metric)
        array.append(series)
    mapping[country] = pl.concat(array)

# if you are not opposed to using another library
# numpy.repeat fits in nicely here
# and should offer good perf as well
array = []
for metric in metrics:
    array.append(pl.repeat(metric,n=len(df),eager=True))
mapping['metrics']=pl.concat(array)
pl.DataFrame(mapping)

shape: (2, 4)
┌─────┬─────┬─────┬─────────┐
│ EU  ┆ US  ┆ AS  ┆ metrics │
│ --- ┆ --- ┆ --- ┆ ---     │
│ i64 ┆ i64 ┆ i64 ┆ str     │
╞═════╪═════╪═════╪═════════╡
│ 10  ┆ 100 ┆ 80  ┆ size    │
│ 80  ┆ 800 ┆ 500 ┆ GDP     │
└─────┴─────┴─────┴─────────┘

Of course the speed tests are based on your shared data; would it still be performant for a large number of columns (width, not length now the controlling factor)?

NB: If the field names could be accessed directly within a context, then that would probably offer even more performance, as everything would occur within the polars framework

3 Comments

added performance comparison to my answer. @etrotta's answer is nice, because it can be done lazily. None of the others can. Otherwise, unpivot gives good performance, if my calculations are correct (gist available for review).
@ouroboros1 great job. Do you mind extending your tests to an increasing number of rows. Your current approach tests multiple structs with multiple columns, but for a single row. Curious to see if there is a change in performance
I would, but solutions do not scale similarly with n rows > 1. My option 1 goes wide, option 2 breaks; so does Dean's; yours & jqurious' go long (dupl metrics), so does etrotta's but alternating. Neither one seems obvious desired output (and OP did not ask, so who's to say). E.g., df = pl.DataFrame({'A': [{'a': 1, 'b': 2},{'a': 3, 'b': 4}]}) with jqurious' method gives: pl.DataFrame({'metric': ['a', 'a', 'b', 'b'], 'A': [1, 3, 2, 4]}). Mine: pl.DataFrame({'metric': ['a', 'b'], 'A': [1, 2], 'column_1': [3, 4]}) and would require a col label tweak. A.1, A.2? Which one would be "correct"?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.