TL;DR
Performance comparison at the end.
Both @etrotta's method and @DeanMacGregor's adjustment perform well on a pl.lazyframe with small Structs (e.g., struct[2]) and columns N <= 15 (not collected). Other methods fail lazily.
With bigger Structs and/or columns N > 15, both unpivot options below start to outperform. Other suggested methods thus far slower in general.
Option 1
out = (df.unpivot()
.unnest('value')
.select(pl.exclude('variable'))
.transpose(include_header=True)
.pipe(
lambda x: x.rename(
dict(zip(x.columns, ['metric'] + df.columns))
)
)
)
Output:
shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GDP ┆ 80 ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘
Explanation / Intermediates
shape: (3, 2)
┌──────────┬───────────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ struct[2] │
╞══════════╪═══════════╡
│ EU ┆ {10,80} │
│ US ┆ {100,800} │
│ AS ┆ {80,500} │
└──────────┴───────────┘
- So that we can apply
df.unnest on new 'value' column:
shape: (3, 3)
┌──────────┬──────┬─────┐
│ variable ┆ size ┆ GDP │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════════╪══════╪═════╡
│ EU ┆ 10 ┆ 80 │
│ US ┆ 100 ┆ 800 │
│ AS ┆ 80 ┆ 500 │
└──────────┴──────┴─────┘
shape: (2, 4)
┌────────┬──────────┬──────────┬──────────┐
│ column ┆ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪══════════╪══════════╪══════════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GDP ┆ 80 ┆ 800 ┆ 500 │
└────────┴──────────┴──────────┴──────────┘
- Now, we just need to rename the columns. Here done via
df.pipe + df.rename. Without the chained operation, that can also be:
out.columns = ['metric'] + df.columns
Option 2
out2 = (df.unpivot()
.unnest('value')
.unpivot(index='variable', variable_name='metric')
.pivot(on='variable', index='metric')
)
Equality check:
out.equals(out2)
# True
Explanation / Intermediates
- Same start as option 1, but followed by a second
df.unpivot to get:
shape: (6, 3)
┌────────┬────────┬───────┐
│ column ┆ metric ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪════════╪═══════╡
│ EU ┆ size ┆ 10 │
│ US ┆ size ┆ 100 │
│ AS ┆ size ┆ 80 │
│ EU ┆ GDP ┆ 80 │
│ US ┆ GDP ┆ 800 │
│ AS ┆ GDP ┆ 500 │
└────────┴────────┴───────┘
- Followed by
df.pivot on 'column' with 'metric' as the index to get desired shape.
Performance comparison (gist)
Columns: n_range=[2**k for k in range(12)]
Struct: 2, 20, 100
Methods compared:
- unpivot_unnest_t (option 1), #@ouroboros1
- unpivot_unnest_t2 (option 1, adj)
- unpivot_pivot (option 2)
- concat_list_expl, #@etrotta
- concat_list_expl_lazy, #lazy
- concat_list_expl2, #@etrotta, #@DeanMacGregor
- concat_list_expl2_lazy, #lazy
- map_batches, #@DeanMacGregor
- loop, #@sammywemmy
Results:
![struct[2]](https://mapledrawhubb.com/i.sstatic.net/lvbGYf9F.png)
![struct[20]](https://mapledrawhubb.com/i.sstatic.net/rn6efxkZ.png)
![struct[100]](https://mapledrawhubb.com/i.sstatic.net/6H07XniB.png)