2

I have a Spark dataframe that contains 2 array columns:

+------------------------------------------------------+-----------------+
|                                                  var1|             var2|
+------------------------------------------------------+-----------------+
|       [black tea, green tea, tea, yerba mate, oolong]|      [green tea]|
|[milk, toned milk, standardised milk, full cream milk]| [cow or buffalo]|
+------------------------------------------------------+-----------------+

I need to remove duplicates according to the following rules:

  1. Check each element of the column var1 with the value of var2 column and remove words from the var1 that partially (for example, 1 word - tea) or completely (for example, 2 words - green tea) match the var2 value.
  2. If there is a complete match and the element is removed completely from the var1 column, then the extra comma (inside the array or at the end) must also be removed
  3. Also remove repeating words from elements in the var1 column.
    For example, if one element contains a word that is then repeated in other elements, these duplicates should be removed (for example, we have a milk, then toned milk, standardized milk, full cream milk - in this case, the desired output looks like this: milk, toned, standardised, full cream)

Required output:

+---------------------------------------+-----------------+
|                                   var1|             var2|
+---------------------------------------+-----------------+
|            [black, yerba mate, oolong]|      [green tea]|
|[milk, toned, standardised, full cream]| [cow or buffalo]|
+---------------------------------------+-----------------+

2 Answers 2

1

Here's one way using arrays higher order functions:

  1. Flatten array var2 into array of single words then using transform on array var1 remove each word that corresponds to on of the words in array var2. finally filter the empty string elements.
  2. Join array var1, and remove duplicate words using regex, then split again to get array
from pyspark.sql import functions as F

df1 = df.withColumn(
    "regex",
    F.concat_ws("|", F.flatten(F.transform("var2", lambda x: F.split(x, "\\s+"))))
).withColumn(
    "var1",
    F.filter(F.expr("transform(var1, x -> regexp_replace(x, regex, ''))"), lambda x: F.trim(x) != "")
).withColumn(
    "var1",
    F.regexp_replace(F.array_join(F.reverse("var1"), "#"), r"\b(\w+)\b(?=.*\b\1\b)", "")
).withColumn(
    "var1",
    F.transform(F.reverse(F.split("var1", "#")), lambda x: F.trim(x))
).drop("regex")

Using this example df:

df = spark.createDataFrame([
    (["black tea", "green tea", "tea", "yerba mate", "oolong"], ["green tea"]),
    (["milk", "toned milk", "standardised milk", "full cream milk"], ["cow or buffalo"])
], ["var1", "var2"])

You get

df1.show(truncate=False)
# +---------------------------------------+----------------+
# |var1                                   |var2            |
# +---------------------------------------+----------------+
# |[black, yerba mate, oolong]            |[green tea]     |
# |[milk, toned, standardised, full cream]|[cow or buffalo]|
# +---------------------------------------+----------------+
Sign up to request clarification or add additional context in comments.

Comments

0

Definitely a savanna buffalo not cow :-))

df = (
  #Split var1 and var2 into single words contained in a list and store in temp columns
  df.select('*',*[split(regexp_replace(col(x).cast('string'),'\]|\[|\,',''),'\s').alias(f'{x}_1') for x in df.columns])
     #Leverage the rich array functions to remove words that exists in var2 from var1
      
     .withColumn('var1', array_except('var1','var2_1'))
     .withColumn('var1', array_except('var1','var2'))
).select('var1','var2')


df.show(truncate=False)


df.show(truncate=False)

+------------------------------------------------------+----------------+
|var1                                                  |var2            |
+------------------------------------------------------+----------------+
|[black, yerba mate, oolong]                           |[green tea]     |
|[milk, toned milk, standardised milk, full cream milk]|[cow or buffalo]|
+------------------------------------------------------+----------------+

4 Comments

Thank you so much for the answer!:)) This is very close to the desired output, but is it possible to preserve composite items such as yerba mate and full cream here (I mean, do not separate them with a comma, as in many cases some meaning will be lost)?
See my edit, just except twice
Sorry, but after the last edits, my input column var1 does not transform at all (I have [black tea, green tea, tea, white tea, yerba mate, oolong]). That is, I can't get the output you got, although getting such an output would be perfect
let me look into that

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.