2

I have a dataframe column with array of string as below. (Key,value) pair

ColA
[(1,2),(1,3),(1,4),(2,3)]

I have to remove duplicate keys by min value and get the results. Dont want to explode and do it. Key should be unique and the key is picked based on the min value. In the above column, there are three pairs with key as 1. So should pick (1,2) since value 2 is min among (1,2),(1,3),(1,4)

Output should be: ColA [(1,2),(2,3)]

I created a udf like

Val removeDup = udf((arr: Seq[String]) => {
Arr.map(x=>x.split(","))}))

Cannot use reduceby key as its a dataframe/dataset.

8
  • Can you clearly state the input dataframe column? Commented Jul 25, 2017 at 3:59
  • It has only one column in a dataframe Commented Jul 25, 2017 at 4:28
  • How do you get ColA (1,2),(2,3) if you take minimum value ? Commented Jul 25, 2017 at 4:58
  • Edited with more explanation Commented Jul 25, 2017 at 5:18
  • @Deek can you provide sample input data for the dataframe you have? Commented Jul 25, 2017 at 5:21

2 Answers 2

1

Okay, so provided that the column if of type String and not type Seq[String], the code below should give you what you want:

val removeDup = udf((str: String) => {
  str.split("\\(|\\)").filter(s => s != "," && s != "").map(s => {
    val array = s.replace("(", "").replace(")", "").split(",")
    (array(0), array(1))
  })
  .groupBy(_._1)
  .mapValues(a => a.sortBy(_._2).head)
  .values
  .toSeq
  .sortBy(_._1)
})

On your example:

val df = spark.sparkContext.parallelize(Seq("(1,2),(1,3),(1,4),(2,3)").toDF("colA")

df.select(removeDup('colA)).show

this yields

+--------------------+
|           UDF(colA)|
+--------------------+
|      [[1,2], [2,3]]|
+--------------------+

If you wish to keep the column type as String, you would need to add .mkString(",") to the udf.

Sign up to request clarification or add additional context in comments.

Comments

0
df.select("v1","v2").groupby("v2").min("v2").show­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­

1 Comment

Its a single column with array for string. This might not help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.