How to merge two array columns into one array with duplicates removed in spark with java

Question

I want two merge two arrays into one array with duplicates removed in spark 2.2 with java.

Input spark dataset below.

 Dataset.show

col1    | col2 
[1,2,3] | [2,3,5]

Expected output -

 Dataset.show

    col1    | col2          | col3
    [1,2,3] | [2,3,5]       |[1,2,3,5]

How can achieve this spark java?.Thanks.

Alex Savitsky · Accepted Answer · 2018-07-18 13:19:34Z

2

Use an UDF:

val mergeArrays = udf((a: Seq[String], b: Seq[String]) => (a ++ b).toSet.toSeq)

Then, assuming your input is

val df = Seq((Seq(1,2),Seq(2,3))).toDF("col1", "col2")

you can merge the arrays with

df.withColumn("col3", mergeArrays($"col1", $"col2"))

resulting in

+------+------+---------+
|  col1|  col2|     col3|
+------+------+---------+
|[1, 2]|[2, 3]|[1, 2, 3]|
+------+------+---------+

EDIT: Java version. As expected, it's way uglier, so if you can use Scala, use that instead.

import org.apache.spark.sql.*;
import org.apache.spark.sql.api.java.UDF2;
import scala.collection.Seq;
import java.util.*;
import static org.apache.spark.sql.types.DataTypes.*;
import static scala.collection.JavaConverters.*;
Dataset<Row> data = spark.createDataFrame(
        Collections.singletonList(RowFactory.create(Arrays.asList(1, 2), Arrays.asList(2, 3))),
        createStructType(Arrays.asList(
                createStructField("col1", createArrayType(IntegerType), true),
                createStructField("col2", createArrayType(IntegerType), true))));
spark.sqlContext().udf().register("udfMerge", (UDF2<Seq<Integer>, Seq<Integer>, Seq<Integer>>) (s1, s2) -> {
    Set<Integer> s = new HashSet<>();
    s.addAll(asJavaCollectionConverter(s1).asJavaCollection());
    s.addAll(asJavaCollectionConverter(s2).asJavaCollection());
    return collectionAsScalaIterableConverter(s).asScala().toSeq();
}, createArrayType(IntegerType));
data.withColumn("col3", functions$.MODULE$.callUDF("udfMerge", functions$.MODULE$.col("col1"), functions$.MODULE$.col("col2"))).show();

edited Jul 18, 2018 at 13:19

answered Jul 17, 2018 at 18:07

Alex Savitsky

2,3815 gold badges24 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Alex Savitsky Over a year ago

I have only used Spark with Scala, but I don't see why this wouldn't work in Java. The syntax won't be as concise, and you'll be missing on some implicits, too, but all Scala classes can be referenced from Java

Vincent Doba · Accepted Answer · 2021-09-22 08:42:37Z

2

Since Spark 2.4, you can use array_union function. It merges two arrays without duplicates:

import static org.apache.spark.sql.functions.array_union;
import static org.apache.spark.sql.functions.col;

dataframe.withColumn("col3", array_union(col("col1"), col("col2")));

answered Sep 22, 2021 at 8:42

Vincent Doba

5,1683 gold badges28 silver badges49 bronze badges

Collectives™ on Stack Overflow

How to merge two array columns into one array with duplicates removed in spark with java

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related