Append a Numpy array into a Pyspark Dataframe

Question

I need to append a NumPy array into a PySpark Dataframe.

The result needs to be like this, adding the var38mc variable:

+----+------+-------------+-------+
|  ID|TARGET|        var38|var38mc|
+----+------+-------------+-------+
| 1.0|   0.0|  117310.9790|   True|
| 3.0|   0.0|  39205.17000|  False|
| 4.0|   0.0|  117310.9790|   True|
+----+------+-------------+-------+

Firstly, I calculated an array with the approximation of 117310.979016494 value.

array_var38mc = np.isclose(train3.select("var38").rdd.flatMap(lambda x: x).collect(), 117310.979016494)

The output is an object numpy.ndarray, like this [True, False, True]

Next, I'm trying to append a Numpy array, previously calculated with the data of this same PySpark.Dataframe.

train4 = train3.withColumn('var38mc',col(df_var38mc))

But I got this error:

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

P.S.: I tried to convert the numpy array into a list and in another PySpark Dataframe without success.

mck · Accepted Answer · 2020-12-07 18:28:08Z

1

Use UDF instead:

import pyspark.sql.functions as F
from pyspark.sql.types import BooleanType
import numpy as np

func = F.udf(lambda x: bool(np.isclose(x, 117310.979016494)), BooleanType())
train4 = train3.withColumn('var38mc', func('var38'))

edited Dec 7, 2020 at 18:28

answered Dec 7, 2020 at 17:52

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Append a Numpy array into a Pyspark Dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related