4

I have a column of lists in a spark dataframe.

+-----------------+
|features         |
+-----------------+
|[0,45,63,0,0,0,0]|
|[0,0,0,85,0,69,0]|
|[0,89,56,0,0,0,0]|
+-----------------+

How do I convert that to a spark dataframe where each element in the list is a column in the dataframe? We can assume that the lists will be the same size.

For Example,

+--------------------+
|c1|c2|c3|c4|c5|c6|c7|
+--------------------+
|0 |45|63|0 |0 |0 |0 |
|0 |0 |0 |85|0 |69|0 |
|0 |89|56|0 |0 |0 |0 |
+--------------------+
2
  • 1
    Perhaps something like this? Commented Dec 8, 2017 at 10:09
  • what's datatype of features column. can you post your schema please. Commented Dec 8, 2017 at 11:55

4 Answers 4

5

What you describe is actually the invert of the VectorAssembler operation.

You can do it by converting to an intermediate RDD, as follows:

spark.version
# u'2.2.0'

# your data:
df.show(truncate=False)
# +-----------------+ 
# |        features | 
# +-----------------+
# |[0,45,63,0,0,0,0]|
# |[0,0,0,85,0,69,0]|
# |[0,89,56,0,0,0,0]|
# +-----------------+ 

dimensionality = 7
out = df.rdd.map(lambda x: [float(x[0][i]) for i in range(dimensionality)]).toDF(schema=['c'+str(i+1) for i in range(dimensionality)])
out.show()
# +---+----+----+----+---+----+---+ 
# | c1|  c2|  c3|  c4| c5|  c6| c7|
# +---+----+----+----+---+----+---+ 
# |0.0|45.0|63.0| 0.0|0.0| 0.0|0.0|
# |0.0| 0.0| 0.0|85.0|0.0|69.0|0.0| 
# |0.0|89.0|56.0| 0.0|0.0| 0.0|0.0| 
# +---+----+----+----+---+----+---+
Sign up to request clarification or add additional context in comments.

Comments

3

You can use getItem:

df.withColumn("c1", df["features"].getItem(0))\
  .withColumn("c2", df["features"].getItem(1))\
  .withColumn("c3", df["features"].getItem(2))\
  .withColumn("c4", df["features"].getItem(3))\
  .withColumn("c5", df["features"].getItem(4))\
  .withColumn("c6", df["features"].getItem(5))\
  .withColumn("c7", df["features"].getItem(6))\
  .drop('features').show()

+--------------------+
|c1|c2|c3|c4|c5|c6|c7|
+--------------------+
|0 |45|63|0 |0 |0 |0 |
|0 |0 |0 |85|0 |69|0 |
|0 |89|56|0 |0 |0 |0 |
+--------------------+

Comments

3

Here's an alternative without converting to rdd,

from pyspark.sql import functions as F

##Not incase of vectorAssembeler.
stop = df.select(F.max(F.size('features')).alias('size')).first().size ## if having a list of varying size, this might be useful.

udf1 = F.udf(lambda x : x.toArray().tolist(),ArrayType(FloatType()))
df = df.withColumn('features1',udf1('features'))

df.select(*[df.features1[i].alias('col_{}'.format(i)) for i in range(1,stop)]).show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
|   45|   63|    0|    0|    0|    0|
|    0|    0|   85|    0|   69|    0|
+-----+-----+-----+-----+-----+-----+

6 Comments

The question specifies column of 'list'. Why use toArray() here?
If it column of list, it fine,we don't need udf itself. but, the column name "features" is where it hits me.
@desertnaut i agree
@desertnaut Iagree too.
Good. As a bonus, I added code highlighting to your posts... ;) @mayankagrawal
|
2

@desertnaut's answer can also be accomplished with dataframe and udf.

import pyspark.sql.functions as F

dimensionality = 7
column_names = ['c'+str(i+1) for i in range(dimensionality)]
splits = [F.udf(lambda val:val[i],FloatType()) for i in range(dimensionality)]
df = df.select(*[s('features').alias(j) for s,j in zip(splits,column_names)])

7 Comments

Did this work on column of vectors or column of array types ?
@Suresh nice catch - it won't work on column of vectors (tested)
Exactly, if we use array type, we can index it straightforward.
@mayankagrawal does not
@mayankagrawal class vector has toArray() method . Only class densevector and sparsevector have values.Correct me if I am wrong. Please check this, spark.apache.org/docs/2.2.0/api/python/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.