3

I have a Spark dataframe that looks as follows:

+-----------+-------------------+
|     ID    |     features      |
+-----------+-------------------+
|   18156431|(5,[0,1,4],[1,1,1])|
|   20260831|(5,[0,4,5],[2,1,1])|   
|   91859831|(5,[0,1],[1,3])    |
|  206186631|(5,[3,4,5],[1,5])  |
|  223134831|(5,[2,3,5],[1,1,1])|
+-----------+-------------------+

In this dataframe the features column is a sparse vector. In my scripts I have to save this DF as file on disk. When doing this, the features column is saved as as text column: example "(5,[0,1,4],[1,1,1])". When importing again in Spark the column stays string, as you could expect. How can I convert the column back to (sparse) vector format?

4
  • Which version of Spark? Which vector class you want to get (ML / MLib) ? How do you read this data? Commented Aug 1, 2016 at 11:16
  • Spark version = 1.6.2. Preferably a ML vector (but you can explain for both). I use the following code to read the data: DF = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true', delimiter=delimiter).load('file://'+path).drop('') Commented Aug 1, 2016 at 12:35
  • There is no ML Vector in 1.6 so it makes things simple :) Commented Aug 1, 2016 at 12:36
  • It can also be a MLLib vector or any other type of (sparse) vector :-) Commented Aug 1, 2016 at 12:41

1 Answer 1

4

Not particularly efficient (it would be a good idea to use a format that preserves types) due to UDF overhead but you can do something like this:

from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf

df = sc.parallelize([
    (18156431, "(5,[0,1,4],[1,1,1])") 
]).toDF(["id", "features"])

parse = udf(lambda s: Vectors.parse(s), VectorUDT())
df.select(parse("features"))

Please note this doesn't port directly to 2.0.0+ and ML Vector. Since ML vectors don't provide parse method you'd have to parse to MLLib and use asML:

parse = udf(lambda s: Vectors.parse(s).asML(), VectorUDT())
Sign up to request clarification or add additional context in comments.

1 Comment

Could you please provide me with an example code of asML in python and Spark 2.0.2? Should I put the asML in a udf?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.