Pyspark / Dataframe: Add new column that keeps nested list as nested list

Question

I have a basic question about dataframes and adding a column that should contain a nested list. This is basically the problem:

b = [[['url.de'],['name']],[['url2.de'],['name2']]]

a = sc.parallelize(b)
a = a.map(lambda p: Row(URL=p[0],name=p[1]))
df = sqlContext.createDataFrame(a)

list1 = [[['a','s', 'o'],['hallo','ti']],[['a','s', 'o'],['hallo','ti']]]
c = [b[0] + [list1[0]],b[1] + [list1[1]]]

#Output looks like this:
[[['url.de'], ['name'], [['a', 's', 'o'], ['hallo', 'ti']]], 
 [['url2.de'], ['name2'], [['a', 's', 'o'], ['hallo', 'ti']]]]

To Create a new Dataframe from this output, I´m trying to create a new schema:

schema = df.withColumn('NewColumn',array(lit("10"))).schema

I then use it to create the new DataFrame:

df = sqlContext.createDataFrame(c,schema)
df.map(lambda x: x).collect()

#Output
[Row(URL=[u'url.de'], name=[u'name'], NewColumn=[u'[a, s, o]', u'[hallo, ti]']),
 Row(URL=[u'url2.de'], name=[u'name2'], NewColumn=[u'[a, s, o]', u'[hallo, ti]'])]

The Problem now is that, the nested list was transformed into a list with two unicode entries instead of keeping the original format.

I think this is due to my definition of the new Column "... array(lit("10"))".

What do I have to use in order to keep the original format?

df['NewColumn'].astype(str).values for remove unicode from columns values. — Anup
– Anup, Commented Jun 27, 2017 at 11:27
doesn´t work: "TypeError: unexpected type: <type 'type'>". And I dont want a string as result I want the nested list as result — bootica
– bootica, Commented Jun 27, 2017 at 12:05

DavidWayne · Accepted Answer · 2017-06-27 14:32:11Z

1

You can directly inspect the schema of the dataframe by calling df.schema. You can see that in the given scenario we have the following:

StructType(
  List(
    StructField(URL,ArrayType(StringType,true),true),
    StructField(name,ArrayType(StringType,true),true),
    StructField(NewColumn,ArrayType(StringType,false),false)
  )
)

The NewColumn that you added is an ArrayType column whose entries are all StringType. So anything that is contained in the array will be converted to a string, even if it is itself an array. If you want to have nested arrays (2 layers), then you need to change your schema so that the the NewColumn field has an ArrayType(ArrayType(StringType,False),False) type. You can do this by explicitly defining the schema:

from pyspark.sql.types import StructType, StructField, ArrayType, StringType

schema = StructType([
    StructField("URL", ArrayType(StringType(),True), True),
    StructField("name", ArrayType(StringType(),True), True),
    StructField("NewColumn", ArrayType(ArrayType(StringType(),False),False), False)])

Or by changing your code by having the NewColumn be defined by nesting the array function, array(array()):

df.withColumn('NewColumn',array(array(lit("10")))).schema

edited Jun 27, 2017 at 14:32

answered Jun 27, 2017 at 14:24

DavidWayne

2,60018 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bootica Over a year ago

Thanks a lot! Thumbs up! Didn´t think of this easy way of doing it.

Collectives™ on Stack Overflow

Pyspark / Dataframe: Add new column that keeps nested list as nested list

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related