3

How do I select the longest string from a list of strings in polars?

Example and expected output:

import polars as pl

df = pl.DataFrame({
    "values": [
        ["the", "quickest", "brown", "fox"],
        ["jumps", "over", "the", "lazy", "dog"],
        []
    ]
})
┌──────────────────────────────┬────────────────┐
│ values                       ┆ longest_string │
│ ---                          ┆ ---            │
│ list[str]                    ┆ str            │
╞══════════════════════════════╪════════════════╡
│ ["the", "quickest", … "fox"] ┆ quickest       │
│ ["jumps", "over", … "dog"]   ┆ jumps          │
│ []                           ┆ null           │
└──────────────────────────────┴────────────────┘

My use case is to select the longest overlapping match.

Edit: elaborating on the longest overlapping match, this is the output for the example provided by polars:

┌────────────┬───────────┬─────────────────────────────────┐
│ values     ┆ matches   ┆ matches_overlapping             │
│ ---        ┆ ---       ┆ ---                             │
│ str        ┆ list[str] ┆ list[str]                       │
╞════════════╪═══════════╪═════════════════════════════════╡
│ discontent ┆ ["disco"] ┆ ["disco", "onte", "discontent"] │
└────────────┴───────────┴─────────────────────────────────┘

I desire a way to select the longest match in matches_overlapping.

2
  • 4
    Can you clarify what you mean by the longest overlapping match? That is completely different to extracting just the longest string from the list. Commented Jun 5 at 18:24
  • Sure thing - I have edited my question accordingly. Commented Jun 5 at 21:47

2 Answers 2

3

You can do something like:

df.with_columns(
    pl.col('values').list.get(
        pl.col('values')
        .list.eval(pl.element().str.len_chars())
        .list.arg_max()
    )
    .alias('longest_string')
)

This expression:

pl.col('values')
.list.eval(pl.element().str.len_chars())
.list.arg_max()

first maps len_chars to each string in each of the lists with .list.eval, then it finds the arg_max (the index of the max element, so in this case, the index of the max length).

The result of that is passed to list.get to retrieve those values.

Sign up to request clarification or add additional context in comments.

Comments

0

You can achieve this in Polars using .list.eval() along with .str.len_chars() to determine the longest string in each list.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.