1

I have a spark data frame that looks like the one from below: Each row contains one list, from which I want to extract an element. I am pretty green at spark, so I convert it into a pandas DataFrame, and then with map functions, I extract the desired elements. The problem is, that the data is huge and therefore this approach is not scalable. What is taking me time is the toPandas() command. Is there an option to access the values within the list from each row?

Thanks!

+--------------------+
|            sentence|
+--------------------+
|[{document, 0, 23...|
|[{document, 0, 68...|
|[{document, 0, 65...|
|[{document, 0, 23...|
|[{document, 0, 23...|
|[{document, 0, 23...|
+--------------------+
1

1 Answer 1

1

you can for example crate a new column like this by picking an element from the list on another column by index.

from pyspark.sql import functions as F
from pyspark.sql import DataFrame
df = DataFrame()
df = df.withColumn("selected_item", F.col("sentence").getItem(0))```
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.