2

I have a DataFrame that I need to separate columns when there are commas. The problem is when I have columns that are all null. In the example below, I need a DataFrame with the columns "mpg", "wt_0", "wt_1" and "carb_0".

How can I unnest struct columns without dropping empty structs?

library(polars)

mtcars$carb <- NA_real_

dt <- as_polars_df(mtcars)

dt$select(
  pl$col("mpg"),
  wt = pl$col("wt")$cast(pl$String)$str$replace("\\.", ","),
  carb = pl$col("carb")$cast(pl$String)$str$replace("\\.", ",")
)$with_columns(
  pl$col("wt")$str$split(",")$list$to_struct(
    fields = \(x) paste0("wt_", x),
    n_field_strategy = "max_width"
  ),
  pl$col("carb")$str$split(",")$list$to_struct(
    fields = \(x) paste0("carb_", x),
    n_field_strategy = "max_width"
  )
)$unnest()

shape: (32, 3)
┌──────┬──────┬──────┐
│ mpg  ┆ wt_0 ┆ wt_1 │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ 21.0 ┆ 2    ┆ 62   │
│ 21.0 ┆ 2    ┆ 875  │
│ 22.8 ┆ 2    ┆ 32   │
│ 21.4 ┆ 3    ┆ 215  │
│ 18.7 ┆ 3    ┆ 44   │
│ …    ┆ …    ┆ …    │
│ 30.4 ┆ 1    ┆ 513  │
│ 15.8 ┆ 3    ┆ 17   │
│ 19.7 ┆ 2    ┆ 77   │
│ 15.0 ┆ 3    ┆ 57   │
│ 21.4 ┆ 2    ┆ 78   │
└──────┴──────┴──────┘

Python version

df.with_columns(
    pl.col("wt", "carb").cast(pl.String).str.replace(r"[.]", ","),
).with_columns(
    pl.col("wt").str.split(",")
      .list.to_struct("max_width", fields=lambda n: f"wt_{n}"),
    pl.col("carb").str.split(",")
      .list.to_struct("max_width", fields=lambda n: f"carb_{n}")
).unnest("wt", "carb")

If carb is not null, the output is as expected.

df = pl.read_csv(b"""mpg,wt,carb
1,1.2,2.3
2,3.4,4.5""")

# shape: (2, 5)
# ┌─────┬──────┬──────┬────────┬────────┐
# │ mpg ┆ wt_0 ┆ wt_1 ┆ carb_0 ┆ carb_1 │
# │ --- ┆ ---  ┆ ---  ┆ ---    ┆ ---    │
# │ i64 ┆ str  ┆ str  ┆ str    ┆ str    │
# ╞═════╪══════╪══════╪════════╪════════╡
# │ 1   ┆ 1    ┆ 2    ┆ 2      ┆ 3      │
# │ 2   ┆ 3    ┆ 4    ┆ 4      ┆ 5      │
# └─────┴──────┴──────┴────────┴────────┘

If carb is null, there is no carb_0 column in the output.

df = pl.read_csv(b"""mpg,wt,carb
1,1.2,
2,3.4,""")

# shape: (2, 3)
# ┌─────┬──────┬──────┐
# │ mpg ┆ wt_0 ┆ wt_1 │
# │ --- ┆ ---  ┆ ---  │
# │ i64 ┆ str  ┆ str  │
# ╞═════╪══════╪══════╡
# │ 1   ┆ 1    ┆ 2    │
# │ 2   ┆ 3    ┆ 4    │
# └─────┴──────┴──────┘
3
  • What should the output be? A single carb_0 column? Commented Apr 15 at 6:29
  • 2
    I've added a Python version because it's an interesting question that I have not seen before. Most of the eyes are on the python-polars tag, so you have a better chance of a response. If you can clarify the expected output, that would help. Commented Apr 15 at 6:40
  • 1
    @jqurious Thanks for improving my question. The output could be mpg, wt_0, wt_1 and carb_0 Commented Apr 15 at 15:40

1 Answer 1

1

In Python what comes to mind is to .fill_null([None]) after the .str.split() in order to have a single null field to unnest.

shape: (2, 3)
┌─────┬─────┬───────────┐
│ mpg ┆ wt  ┆ carb      │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ str ┆ list[str] │
╞═════╪═════╪═══════════╡
│ 1   ┆ 1,2 ┆ [null]    │
│ 2   ┆ 3,4 ┆ [null]    │
└─────┴─────┴───────────┘

I don't know R, but trying to look at the r-polars docs, it seems pl$lit(list(NA)) may be the equivalent?

import polars as pl

df = pl.read_csv(b"""
mpg,wt,carb
1,1.2,
2,3.4,
""".strip())
(
    df.with_columns(
        pl.col("wt", "carb").cast(pl.String).str.replace(r"[.]", ",")
    )
    .with_columns(
        pl.col("wt").str.split(",")
          .fill_null([None])
          .list.to_struct(
              fields = lambda n: f"wt_{n}",
              n_field_strategy = "max_width"
          ),
        pl.col("carb").str.split(",")
          .fill_null([None])
          .list.to_struct(
              fields = lambda n: f"carb_{n}",
              n_field_strategy = "max_width"
          )
    )
    .unnest("wt", "carb")
)
shape: (2, 4)
┌─────┬──────┬──────┬────────┐
│ mpg ┆ wt_0 ┆ wt_1 ┆ carb_0 │
│ --- ┆ ---  ┆ ---  ┆ ---    │
│ i64 ┆ str  ┆ str  ┆ str    │
╞═════╪══════╪══════╪════════╡
│ 1   ┆ 1    ┆ 2    ┆ null   │
│ 2   ┆ 3    ┆ 4    ┆ null   │
└─────┴──────┴──────┴────────┘
Sign up to request clarification or add additional context in comments.

1 Comment

"it seems pl$lit(list(NA)) may be the equivalent?" Yes, it works. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.