0

I have a CSV file with two columns

id, features

the id column is a string and the features column is a comma delimited list of feature values for a Machine Learning algorithm ie. "[1,4,5]" I basically just need to call Vectors.parse() on the value to get a vector, but I don't want to convert to an RDD first.

I want to get this into a Spark Dataframe where the features column is a org.apache.spark.mllib.linalg.Vector

I am reading this into a dataframe with the databricks csv api and I'm trying to convert the features column to a Vector.

Does anyone know how to do this in Java?

1 Answer 1

1

I found one way to do it with a UDF. Are there any other ways to do this?

  HashMap<String, String> options = new HashMap<String, String>();
  options.put("header", "true");
  String input= args[0];

  sqlc.udf().register("toVector", new UDF1<String, Vector>() {
     @Override
     public Vector call(String t1) throws Exception {
        return Vectors.parse(t1);
     }
  }, new VectorUDT());

  StructField[] fields = {new StructField("id",DataTypes.StringType,false, Metadata.empty()) , new StructField("features", DataTypes.StringType, false, Metadata.empty())};
  StructType schema = new StructType(fields);

  DataFrame df = sqlc.read().format("com.databricks.spark.csv").schema(schema).options(options).load(input);

  df = df.withColumn("features", functions.callUDF("toVector", df.col("features")));
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.