Converting CSV values to Vector in Spark Dataframe in Java

Question

I have a CSV file with two columns

id, features

the id column is a string and the features column is a comma delimited list of feature values for a Machine Learning algorithm ie. "[1,4,5]" I basically just need to call Vectors.parse() on the value to get a vector, but I don't want to convert to an RDD first.

I want to get this into a Spark Dataframe where the features column is a org.apache.spark.mllib.linalg.Vector

I am reading this into a dataframe with the databricks csv api and I'm trying to convert the features column to a Vector.

Does anyone know how to do this in Java?

Mike Pone · Accepted Answer · 2018-03-19 17:32:50Z

1

I found one way to do it with a UDF. Are there any other ways to do this?

  HashMap<String, String> options = new HashMap<String, String>();
  options.put("header", "true");
  String input= args[0];

  sqlc.udf().register("toVector", new UDF1<String, Vector>() {
     @Override
     public Vector call(String t1) throws Exception {
        return Vectors.parse(t1);
     }
  }, new VectorUDT());

  StructField[] fields = {new StructField("id",DataTypes.StringType,false, Metadata.empty()) , new StructField("features", DataTypes.StringType, false, Metadata.empty())};
  StructType schema = new StructType(fields);

  DataFrame df = sqlc.read().format("com.databricks.spark.csv").schema(schema).options(options).load(input);

  df = df.withColumn("features", functions.callUDF("toVector", df.col("features")));

edited Mar 19, 2018 at 17:32

answered Mar 18, 2018 at 17:24

Mike Pone

19.4k13 gold badges57 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Converting CSV values to Vector in Spark Dataframe in Java

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related