How can I store a numpy array as a new column in PySpark DataFrame?

Question

I have got a numpy array from np.select and I want to store it as a new column in PySpark DataFrame. How can I do that?

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

pdf = pd.DataFrame({'a': [1,2,3], 'b': ['abc', 'cde', 'edf']})
df_data = spark.createDataFrame(pdf, schema='a string, b string')

There are a few conditions and choices for which I use np.select like

np.select(conditions, choices, default='Other')

This returns the following nd-array

[['val1'], ['val2'], ['val3']]

Now I want to save this nd-array as a new column in df_data.

Could you provide som example-code? I.e. the pyspark-code to create your DataFrame and the python-code to create the Numpy-array? — Cleared
– Cleared, Commented May 25, 2022 at 8:37

ZygD · Accepted Answer · 2022-05-25 11:01:52Z

2

You may try first converting your ndarray to list and providing every element of the list to its appropriate location into Spark array.

ndarray = np.select(conditions, choices, default='Other')
nd_list = ndarray.tolist()
df_data = df_data.withColumn('ndarray', F.array([F.array(F.lit(e[0])) for e in nd_list]))

This way you would create array of arrays which would probably be an equivalent of your list of lists.

answered May 25, 2022 at 11:01

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How can I store a numpy array as a new column in PySpark DataFrame?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related