I need to append a NumPy array into a PySpark Dataframe.
The result needs to be like this, adding the var38mc variable:
+----+------+-------------+-------+
| ID|TARGET| var38|var38mc|
+----+------+-------------+-------+
| 1.0| 0.0| 117310.9790| True|
| 3.0| 0.0| 39205.17000| False|
| 4.0| 0.0| 117310.9790| True|
+----+------+-------------+-------+
Firstly, I calculated an array with the approximation of 117310.979016494 value.
array_var38mc = np.isclose(train3.select("var38").rdd.flatMap(lambda x: x).collect(), 117310.979016494)
The output is an object numpy.ndarray, like this [True, False, True]
Next, I'm trying to append a Numpy array, previously calculated with the data of this same PySpark.Dataframe.
train4 = train3.withColumn('var38mc',col(df_var38mc))
But I got this error:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
P.S.: I tried to convert the numpy array into a list and in another PySpark Dataframe without success.