I have a Spark dataframe that contains 2 array columns:
+------------------------------------------------------+-----------------+
| var1| var2|
+------------------------------------------------------+-----------------+
| [black tea, green tea, tea, yerba mate, oolong]| [green tea]|
|[milk, toned milk, standardised milk, full cream milk]| [cow or buffalo]|
+------------------------------------------------------+-----------------+
I need to remove duplicates according to the following rules:
- Check each element of the column
var1with the value ofvar2column and remove words from thevar1that partially (for example, 1 word -tea) or completely (for example, 2 words -green tea) match thevar2value. - If there is a complete match and the element is removed completely from the
var1column, then the extra comma (inside the array or at the end) must also be removed - Also remove repeating words from elements in the
var1column.
For example, if one element contains a word that is then repeated in other elements, these duplicates should be removed (for example, we have amilk, thentoned milk,standardized milk,full cream milk- in this case, the desired output looks like this:milk, toned, standardised, full cream)
Required output:
+---------------------------------------+-----------------+
| var1| var2|
+---------------------------------------+-----------------+
| [black, yerba mate, oolong]| [green tea]|
|[milk, toned, standardised, full cream]| [cow or buffalo]|
+---------------------------------------+-----------------+