0

I want two merge two arrays into one array with duplicates removed in spark 2.2 with java.

Input spark dataset below.

 Dataset.show

col1    | col2 
[1,2,3] | [2,3,5]  

Expected output -

 Dataset.show

    col1    | col2          | col3
    [1,2,3] | [2,3,5]       |[1,2,3,5]

How can achieve this spark java?.Thanks.

2 Answers 2

2

Use an UDF:

val mergeArrays = udf((a: Seq[String], b: Seq[String]) => (a ++ b).toSet.toSeq)

Then, assuming your input is

val df = Seq((Seq(1,2),Seq(2,3))).toDF("col1", "col2")

you can merge the arrays with

df.withColumn("col3", mergeArrays($"col1", $"col2"))

resulting in

+------+------+---------+
|  col1|  col2|     col3|
+------+------+---------+
|[1, 2]|[2, 3]|[1, 2, 3]|
+------+------+---------+

EDIT: Java version. As expected, it's way uglier, so if you can use Scala, use that instead.

import org.apache.spark.sql.*;
import org.apache.spark.sql.api.java.UDF2;
import scala.collection.Seq;
import java.util.*;
import static org.apache.spark.sql.types.DataTypes.*;
import static scala.collection.JavaConverters.*;
Dataset<Row> data = spark.createDataFrame(
        Collections.singletonList(RowFactory.create(Arrays.asList(1, 2), Arrays.asList(2, 3))),
        createStructType(Arrays.asList(
                createStructField("col1", createArrayType(IntegerType), true),
                createStructField("col2", createArrayType(IntegerType), true))));
spark.sqlContext().udf().register("udfMerge", (UDF2<Seq<Integer>, Seq<Integer>, Seq<Integer>>) (s1, s2) -> {
    Set<Integer> s = new HashSet<>();
    s.addAll(asJavaCollectionConverter(s1).asJavaCollection());
    s.addAll(asJavaCollectionConverter(s2).asJavaCollection());
    return collectionAsScalaIterableConverter(s).asScala().toSeq();
}, createArrayType(IntegerType));
data.withColumn("col3", functions$.MODULE$.callUDF("udfMerge", functions$.MODULE$.col("col1"), functions$.MODULE$.col("col2"))).show();
Sign up to request clarification or add additional context in comments.

1 Comment

I have only used Spark with Scala, but I don't see why this wouldn't work in Java. The syntax won't be as concise, and you'll be missing on some implicits, too, but all Scala classes can be referenced from Java
2

Since Spark 2.4, you can use array_union function. It merges two arrays without duplicates:

import static org.apache.spark.sql.functions.array_union;
import static org.apache.spark.sql.functions.col;

dataframe.withColumn("col3", array_union(col("col1"), col("col2")));

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.