Here is the code to create a pyspark.sql DataFrame
import numpy as np
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
df = pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]]), columns=['a','b','c'])
sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)
So that sparkdf looks like
a b c
1 2 3
4 5 6
7 8 9
10 11 12
Now I would like to add as a new column a numpy array (or even a list)
new_col = np.array([20,20,20,20])
But the standard way
sparkdf = sparkdf.withColumn('newcol', new_col)
fails. Probably an udf is the way to go, but I don't know how to create an udf that assigns one different value per DataFrame row, i.e. that iterates through new_col. I have looked at other pyspark and pyspark.sql but couldn't find a solution. Also I need to stay within pyspark.sql so not a scala solution. Thanks!