1

I know PySpark DataFrames are immutable, so I'd like to create a new column resulting from a transformation applied to an existing column of the PySpark DataFrame. My data is too large to use collect().

The column in question is a list of lists of unique ints (no repeats of an int in a given list), e.g.:

[1]
[1,2]
[1,2,3]
[2,3]

The above is a toy example, as my actual DataFrame has lists with a max length of 52 unique ints. I would like to generate a column that iterates through the list of lists of ints and removes one element for each loop. The element to be removed would be one from the set of unique elements across all lists, which in this case is [1,2,3].

So for the first iteration:

Remove element 1, such that the results are:

[]
[2]
[2,3]
[2,3]

For the second iteration:

Remove element 2, such that the results are:

[1]
[1]
[1,3]
[3]

etc. and repeat above with element 3.

For each iteration, I would like to append the results to the original PySpark DataFrame to run some queries against, using this "filtered" column as a row filter for the original DataFrame.

My question is, how do I convert a column of a PySpark DataFrame to a list? My data set is large, so df.select('columnofintlists').collect() results in memory issues (e.g.: Kryo serialization failed: Buffer overflow. Available: 0, required: 1448662. To avoid this, increase spark.kryoserializer.buffer.max value.).

2 Answers 2

1

Here is an example from pyspark docs:

>>>from pyspark.sql.functions import array_remove
>>>from pyspark.sql import SparkSession, SQLContext

>>>sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
>>>spark = SparkSession(sc)

>>>df = spark.createDataFrame([([1, 2, 3, 1, 1],), ([],)], ['data'])

>>>df.select(array_remove(df.data, 1)).collect()
[Row(array_remove(data, 1)=[2, 3]), Row(array_remove(data, 1)=[])]

Reference: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=w

Sign up to request clarification or add additional context in comments.

Comments

0

df.toLocalIterator() will return an iterator for loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.