Transform string column to vector column Spark DataFrames

Question

I have a Spark dataframe that looks as follows:

+-----------+-------------------+
|     ID    |     features      |
+-----------+-------------------+
|   18156431|(5,[0,1,4],[1,1,1])|
|   20260831|(5,[0,4,5],[2,1,1])|   
|   91859831|(5,[0,1],[1,3])    |
|  206186631|(5,[3,4,5],[1,5])  |
|  223134831|(5,[2,3,5],[1,1,1])|
+-----------+-------------------+

In this dataframe the features column is a sparse vector. In my scripts I have to save this DF as file on disk. When doing this, the features column is saved as as text column: example "(5,[0,1,4],[1,1,1])". When importing again in Spark the column stays string, as you could expect. How can I convert the column back to (sparse) vector format?

Which version of Spark? Which vector class you want to get (ML / MLib) ? How do you read this data? — zero323
– zero323, Commented Aug 1, 2016 at 11:16
Spark version = 1.6.2. Preferably a ML vector (but you can explain for both). I use the following code to read the data: DF = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true', delimiter=delimiter).load('file://'+path).drop('') — Stijn
– Stijn, Commented Aug 1, 2016 at 12:35
It can also be a MLLib vector or any other type of (sparse) vector :-) — Stijn
– Stijn, Commented Aug 1, 2016 at 12:41

zero323 · Accepted Answer · 2019-01-12 15:48:48Z

4

Not particularly efficient (it would be a good idea to use a format that preserves types) due to UDF overhead but you can do something like this:

from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf

df = sc.parallelize([
    (18156431, "(5,[0,1,4],[1,1,1])") 
]).toDF(["id", "features"])

parse = udf(lambda s: Vectors.parse(s), VectorUDT())
df.select(parse("features"))

Please note this doesn't port directly to 2.0.0+ and ML Vector. Since ML vectors don't provide parse method you'd have to parse to MLLib and use asML:

parse = udf(lambda s: Vectors.parse(s).asML(), VectorUDT())

edited Jan 12, 2019 at 15:48

answered Aug 1, 2016 at 12:49

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Stijn Over a year ago

Could you please provide me with an example code of asML in python and Spark 2.0.2? Should I put the asML in a udf?

Collectives™ on Stack Overflow

Transform string column to vector column Spark DataFrames

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related