I have a given DataSet :
+-------------------+--------------------+
| date| products|
+-------------------+--------------------+
|2017-08-31 22:00:00|[361, 361, 361, 3...|
|2017-09-22 22:00:00|[361, 362, 362, 3...|
|2017-09-21 22:00:00|[361, 361, 361, 3...|
|2017-09-28 22:00:00|[360, 361, 361, 3...|
where products column is an array of strings with possible duplicated items.
I would like to remove this duplication (within one row)
What I did is basically write an UDF function like that
val removeDuplicates: WrappedArray[String] => WrappedArray[String] = _.distinct
val udfremoveDuplicates = udf(removeDuplicates)
This solution gives me a proper results :
+-------------------+--------------------+--------------------+
| date| products| rm_duplicates|
+-------------------+--------------------+--------------------+
|2017-08-31 22:00:00|[361, 361, 361, 3...|[361, 362, 363, 3...|
|2017-09-22 22:00:00|[361, 362, 362, 3...|[361, 362, 363, 3...|
My questions are :
Do Spark provides a better/more efficient way of getting this result ?
I was thinking about using a map - but how to get desired column as a List to be able to use 'distinct' method like in my removeDuplicates lambda ?
Edit: I marked this topic with java tag, because it does not matter to me in which language (scala or java) I will get an answear :) Edit2: typos
.toListafter distinct and update your udf type annotation to return a List":123:345:126:", and do a substring search for<delimiter><element><delimiter>. Complex data structures, even arrays, require lots more processing than strings.