I have a basic question about dataframes and adding a column that should contain a nested list. This is basically the problem:
b = [[['url.de'],['name']],[['url2.de'],['name2']]]
a = sc.parallelize(b)
a = a.map(lambda p: Row(URL=p[0],name=p[1]))
df = sqlContext.createDataFrame(a)
list1 = [[['a','s', 'o'],['hallo','ti']],[['a','s', 'o'],['hallo','ti']]]
c = [b[0] + [list1[0]],b[1] + [list1[1]]]
#Output looks like this:
[[['url.de'], ['name'], [['a', 's', 'o'], ['hallo', 'ti']]],
[['url2.de'], ['name2'], [['a', 's', 'o'], ['hallo', 'ti']]]]
To Create a new Dataframe from this output, I´m trying to create a new schema:
schema = df.withColumn('NewColumn',array(lit("10"))).schema
I then use it to create the new DataFrame:
df = sqlContext.createDataFrame(c,schema)
df.map(lambda x: x).collect()
#Output
[Row(URL=[u'url.de'], name=[u'name'], NewColumn=[u'[a, s, o]', u'[hallo, ti]']),
Row(URL=[u'url2.de'], name=[u'name2'], NewColumn=[u'[a, s, o]', u'[hallo, ti]'])]
The Problem now is that, the nested list was transformed into a list with two unicode entries instead of keeping the original format.
I think this is due to my definition of the new Column "... array(lit("10"))".
What do I have to use in order to keep the original format?