0

I need to append a NumPy array into a PySpark Dataframe.

The result needs to be like this, adding the var38mc variable:

+----+------+-------------+-------+
|  ID|TARGET|        var38|var38mc|
+----+------+-------------+-------+
| 1.0|   0.0|  117310.9790|   True|
| 3.0|   0.0|  39205.17000|  False|
| 4.0|   0.0|  117310.9790|   True|
+----+------+-------------+-------+

Firstly, I calculated an array with the approximation of 117310.979016494 value.

array_var38mc = np.isclose(train3.select("var38").rdd.flatMap(lambda x: x).collect(), 117310.979016494)

The output is an object numpy.ndarray, like this [True, False, True]

Next, I'm trying to append a Numpy array, previously calculated with the data of this same PySpark.Dataframe.

train4 = train3.withColumn('var38mc',col(df_var38mc))

But I got this error:

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

P.S.: I tried to convert the numpy array into a list and in another PySpark Dataframe without success.

1 Answer 1

1

Use UDF instead:

import pyspark.sql.functions as F
from pyspark.sql.types import BooleanType
import numpy as np

func = F.udf(lambda x: bool(np.isclose(x, 117310.979016494)), BooleanType())
train4 = train3.withColumn('var38mc', func('var38'))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.