Convert Column of List to Dataframe

Question

I have a column of lists in a spark dataframe.

+-----------------+
|features         |
+-----------------+
|[0,45,63,0,0,0,0]|
|[0,0,0,85,0,69,0]|
|[0,89,56,0,0,0,0]|
+-----------------+

How do I convert that to a spark dataframe where each element in the list is a column in the dataframe? We can assume that the lists will be the same size.

For Example,

+--------------------+
|c1|c2|c3|c4|c5|c6|c7|
+--------------------+
|0 |45|63|0 |0 |0 |0 |
|0 |0 |0 |85|0 |69|0 |
|0 |89|56|0 |0 |0 |0 |
+--------------------+

what's datatype of features column. can you post your schema please. — Suresh
– Suresh, Commented Dec 8, 2017 at 11:55

desertnaut · Accepted Answer · 2017-12-08 11:11:15Z

5

What you describe is actually the invert of the VectorAssembler operation.

You can do it by converting to an intermediate RDD, as follows:

spark.version
# u'2.2.0'

# your data:
df.show(truncate=False)
# +-----------------+ 
# |        features | 
# +-----------------+
# |[0,45,63,0,0,0,0]|
# |[0,0,0,85,0,69,0]|
# |[0,89,56,0,0,0,0]|
# +-----------------+ 

dimensionality = 7
out = df.rdd.map(lambda x: [float(x[0][i]) for i in range(dimensionality)]).toDF(schema=['c'+str(i+1) for i in range(dimensionality)])
out.show()
# +---+----+----+----+---+----+---+ 
# | c1|  c2|  c3|  c4| c5|  c6| c7|
# +---+----+----+----+---+----+---+ 
# |0.0|45.0|63.0| 0.0|0.0| 0.0|0.0|
# |0.0| 0.0| 0.0|85.0|0.0|69.0|0.0| 
# |0.0|89.0|56.0| 0.0|0.0| 0.0|0.0| 
# +---+----+----+----+---+----+---+

edited Dec 8, 2017 at 11:11

answered Dec 8, 2017 at 10:28

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

desertnaut · Accepted Answer · 2017-12-08 12:13:42Z

3

You can use getItem:

df.withColumn("c1", df["features"].getItem(0))\
  .withColumn("c2", df["features"].getItem(1))\
  .withColumn("c3", df["features"].getItem(2))\
  .withColumn("c4", df["features"].getItem(3))\
  .withColumn("c5", df["features"].getItem(4))\
  .withColumn("c6", df["features"].getItem(5))\
  .withColumn("c7", df["features"].getItem(6))\
  .drop('features').show()

+--------------------+
|c1|c2|c3|c4|c5|c6|c7|
+--------------------+
|0 |45|63|0 |0 |0 |0 |
|0 |0 |0 |85|0 |69|0 |
|0 |89|56|0 |0 |0 |0 |
+--------------------+

edited Dec 8, 2017 at 12:13

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

answered Dec 8, 2017 at 10:14

Justin

3483 silver badges15 bronze badges

Comments

desertnaut · Accepted Answer · 2017-12-08 12:13:56Z

3

Here's an alternative without converting to rdd,

from pyspark.sql import functions as F

##Not incase of vectorAssembeler.
stop = df.select(F.max(F.size('features')).alias('size')).first().size ## if having a list of varying size, this might be useful.

udf1 = F.udf(lambda x : x.toArray().tolist(),ArrayType(FloatType()))
df = df.withColumn('features1',udf1('features'))

df.select(*[df.features1[i].alias('col_{}'.format(i)) for i in range(1,stop)]).show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
|   45|   63|    0|    0|    0|    0|
|    0|    0|   85|    0|   69|    0|
+-----+-----+-----+-----+-----+-----+

edited Dec 8, 2017 at 12:13

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

answered Dec 8, 2017 at 11:46

Suresh

5,8802 gold badges27 silver badges42 bronze badges

6 Comments

mayank agrawal Over a year ago

The question specifies column of 'list'. Why use toArray() here?

Suresh Over a year ago

If it column of list, it fine,we don't need udf itself. but, the column name "features" is where it hits me.

mayank agrawal Over a year ago

@desertnaut i agree

Suresh Over a year ago

@desertnaut Iagree too.

desertnaut Over a year ago

Good. As a bonus, I added code highlighting to your posts... ;) @mayankagrawal

|

desertnaut · Accepted Answer · 2017-12-08 12:14:09Z

2

@desertnaut's answer can also be accomplished with dataframe and udf.

import pyspark.sql.functions as F

dimensionality = 7
column_names = ['c'+str(i+1) for i in range(dimensionality)]
splits = [F.udf(lambda val:val[i],FloatType()) for i in range(dimensionality)]
df = df.select(*[s('features').alias(j) for s,j in zip(splits,column_names)])

edited Dec 8, 2017 at 12:14

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

answered Dec 8, 2017 at 11:24

mayank agrawal

2,5552 gold badges16 silver badges33 bronze badges

7 Comments

Suresh Over a year ago

Did this work on column of vectors or column of array types ?

desertnaut Over a year ago

@Suresh nice catch - it won't work on column of vectors (tested)

Suresh Over a year ago

Exactly, if we use array type, we can index it straightforward.

desertnaut Over a year ago

@mayankagrawal does not

Suresh Over a year ago

@mayankagrawal class vector has toArray() method . Only class densevector and sparsevector have values.Correct me if I am wrong. Please check this, spark.apache.org/docs/2.2.0/api/python/…

|

Collectives™ on Stack Overflow

Convert Column of List to Dataframe

4 Answers 4

Comments

Comments

6 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

6 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related