Extract values from list within a spark dataframe without convert to pandas

Question

I have a spark data frame that looks like the one from below: Each row contains one list, from which I want to extract an element. I am pretty green at spark, so I convert it into a pandas DataFrame, and then with map functions, I extract the desired elements. The problem is, that the data is huge and therefore this approach is not scalable. What is taking me time is the toPandas() command. Is there an option to access the values within the list from each row?

Thanks!

+--------------------+
|            sentence|
+--------------------+
|[{document, 0, 23...|
|[{document, 0, 68...|
|[{document, 0, 65...|
|[{document, 0, 23...|
|[{document, 0, 23...|
|[{document, 0, 23...|
+--------------------+

You can use getItem (spark.apache.org/docs/latest/api/python/reference/api/…) to get element from list based on index. If you are looking for more specialized case, share the logic you apply after converting to pandas. — Nithish
– Nithish, Commented Dec 9, 2021 at 21:35

Felix Kleine Bösing · Accepted Answer · 2021-12-11 16:16:07Z

1

you can for example crate a new column like this by picking an element from the list on another column by index.

from pyspark.sql import functions as F
from pyspark.sql import DataFrame
df = DataFrame()
df = df.withColumn("selected_item", F.col("sentence").getItem(0))```

answered Dec 11, 2021 at 16:16

Felix Kleine Bösing

6053 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extract values from list within a spark dataframe without convert to pandas

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related