0

I have a basic question about dataframes and adding a column that should contain a nested list. This is basically the problem:

b = [[['url.de'],['name']],[['url2.de'],['name2']]]

a = sc.parallelize(b)
a = a.map(lambda p: Row(URL=p[0],name=p[1]))
df = sqlContext.createDataFrame(a)

list1 = [[['a','s', 'o'],['hallo','ti']],[['a','s', 'o'],['hallo','ti']]]
c = [b[0] + [list1[0]],b[1] + [list1[1]]]

#Output looks like this:
[[['url.de'], ['name'], [['a', 's', 'o'], ['hallo', 'ti']]], 
 [['url2.de'], ['name2'], [['a', 's', 'o'], ['hallo', 'ti']]]]

To Create a new Dataframe from this output, I´m trying to create a new schema:

schema = df.withColumn('NewColumn',array(lit("10"))).schema

I then use it to create the new DataFrame:

df = sqlContext.createDataFrame(c,schema)
df.map(lambda x: x).collect()

#Output
[Row(URL=[u'url.de'], name=[u'name'], NewColumn=[u'[a, s, o]', u'[hallo, ti]']),
 Row(URL=[u'url2.de'], name=[u'name2'], NewColumn=[u'[a, s, o]', u'[hallo, ti]'])]

The Problem now is that, the nested list was transformed into a list with two unicode entries instead of keeping the original format.

I think this is due to my definition of the new Column "... array(lit("10"))".

What do I have to use in order to keep the original format?

2
  • df['NewColumn'].astype(str).values for remove unicode from columns values. Commented Jun 27, 2017 at 11:27
  • doesn´t work: "TypeError: unexpected type: <type 'type'>". And I dont want a string as result I want the nested list as result Commented Jun 27, 2017 at 12:05

1 Answer 1

1

You can directly inspect the schema of the dataframe by calling df.schema. You can see that in the given scenario we have the following:

StructType(
  List(
    StructField(URL,ArrayType(StringType,true),true),
    StructField(name,ArrayType(StringType,true),true),
    StructField(NewColumn,ArrayType(StringType,false),false)
  )
)

The NewColumn that you added is an ArrayType column whose entries are all StringType. So anything that is contained in the array will be converted to a string, even if it is itself an array. If you want to have nested arrays (2 layers), then you need to change your schema so that the the NewColumn field has an ArrayType(ArrayType(StringType,False),False) type. You can do this by explicitly defining the schema:

from pyspark.sql.types import StructType, StructField, ArrayType, StringType

schema = StructType([
    StructField("URL", ArrayType(StringType(),True), True),
    StructField("name", ArrayType(StringType(),True), True),
    StructField("NewColumn", ArrayType(ArrayType(StringType(),False),False), False)])

Or by changing your code by having the NewColumn be defined by nesting the array function, array(array()):

df.withColumn('NewColumn',array(array(lit("10")))).schema
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot! Thumbs up! Didn´t think of this easy way of doing it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.