My Spark dataframe (created from a Hive table) looks like:
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|racist|filtered |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, 😍🔥, https://time.com/sxp3onz1w8] |
|false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay] |
and I am trying to remove urls from the filtered field.
I have tried:
val regex = "(https?\\://)\\S+".r
def removeRegex( input: Array[String] ) : Array[String] = {
regex.replaceAllIn(input, "")
}
val removeRegexUDF = udf(removeRegex)
filteredDF.withColumn("noURL", removeRegexUDF('filtered)).show
which gives this error:
<console>:60: error: overloaded method value replaceAllIn with alternatives:
(target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and>
(target: CharSequence,replacement: String)String
cannot be applied to (Array[String], String)
regex.replaceAllIn(input, "")
^
I am very much a newbie at Scala so any guidance you can give on how to handle the filtered array in the udf is much appreciated. (Or if there is a better way of doing this I'm happy to hear it).