Remove duplicates from PySpark array column by checking each element

Question

I have a Spark dataframe that contains 2 array columns:

+------------------------------------------------------+-----------------+
|                                                  var1|             var2|
+------------------------------------------------------+-----------------+
|       [black tea, green tea, tea, yerba mate, oolong]|      [green tea]|
|[milk, toned milk, standardised milk, full cream milk]| [cow or buffalo]|
+------------------------------------------------------+-----------------+

I need to remove duplicates according to the following rules:

Check each element of the column var1 with the value of var2 column and remove words from the var1 that partially (for example, 1 word - tea) or completely (for example, 2 words - green tea) match the var2 value.
If there is a complete match and the element is removed completely from the var1 column, then the extra comma (inside the array or at the end) must also be removed
Also remove repeating words from elements in the var1 column.
For example, if one element contains a word that is then repeated in other elements, these duplicates should be removed (for example, we have a milk, then toned milk, standardized milk, full cream milk - in this case, the desired output looks like this: milk, toned, standardised, full cream)

Required output:

+---------------------------------------+-----------------+
|                                   var1|             var2|
+---------------------------------------+-----------------+
|            [black, yerba mate, oolong]|      [green tea]|
|[milk, toned, standardised, full cream]| [cow or buffalo]|
+---------------------------------------+-----------------+

blackbishop · Accepted Answer · 2022-08-01 13:21:47Z

Here's one way using arrays higher order functions:

Flatten array var2 into array of single words then using transform on array var1 remove each word that corresponds to on of the words in array var2. finally filter the empty string elements.
Join array var1, and remove duplicate words using regex, then split again to get array

from pyspark.sql import functions as F

df1 = df.withColumn(
    "regex",
    F.concat_ws("|", F.flatten(F.transform("var2", lambda x: F.split(x, "\\s+"))))
).withColumn(
    "var1",
    F.filter(F.expr("transform(var1, x -> regexp_replace(x, regex, ''))"), lambda x: F.trim(x) != "")
).withColumn(
    "var1",
    F.regexp_replace(F.array_join(F.reverse("var1"), "#"), r"\b(\w+)\b(?=.*\b\1\b)", "")
).withColumn(
    "var1",
    F.transform(F.reverse(F.split("var1", "#")), lambda x: F.trim(x))
).drop("regex")

Using this example df:

df = spark.createDataFrame([
    (["black tea", "green tea", "tea", "yerba mate", "oolong"], ["green tea"]),
    (["milk", "toned milk", "standardised milk", "full cream milk"], ["cow or buffalo"])
], ["var1", "var2"])

You get

df1.show(truncate=False)
# +---------------------------------------+----------------+
# |var1                                   |var2            |
# +---------------------------------------+----------------+
# |[black, yerba mate, oolong]            |[green tea]     |
# |[milk, toned, standardised, full cream]|[cow or buffalo]|
# +---------------------------------------+----------------+

wwnde · Accepted Answer · 2022-08-01 09:29:51Z

0

Definitely a savanna buffalo not cow :-))

df = (
  #Split var1 and var2 into single words contained in a list and store in temp columns
  df.select('*',*[split(regexp_replace(col(x).cast('string'),'\]|\[|\,',''),'\s').alias(f'{x}_1') for x in df.columns])
     #Leverage the rich array functions to remove words that exists in var2 from var1
      
     .withColumn('var1', array_except('var1','var2_1'))
     .withColumn('var1', array_except('var1','var2'))
).select('var1','var2')


df.show(truncate=False)


df.show(truncate=False)

+------------------------------------------------------+----------------+
|var1                                                  |var2            |
+------------------------------------------------------+----------------+
|[black, yerba mate, oolong]                           |[green tea]     |
|[milk, toned milk, standardised milk, full cream milk]|[cow or buffalo]|
+------------------------------------------------------+----------------+

edited Aug 1, 2022 at 9:29

answered Aug 1, 2022 at 7:25

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

4 Comments

Hilary Over a year ago

Thank you so much for the answer!:)) This is very close to the desired output, but is it possible to preserve composite items such as yerba mate and full cream here (I mean, do not separate them with a comma, as in many cases some meaning will be lost)?

wwnde Over a year ago

See my edit, just except twice

Hilary Over a year ago

Sorry, but after the last edits, my input column var1 does not transform at all (I have [black tea, green tea, tea, white tea, yerba mate, oolong]). That is, I can't get the output you got, although getting such an output would be perfect

wwnde Over a year ago

let me look into that

Collectives™ on Stack Overflow

Remove duplicates from PySpark array column by checking each element

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related