1

My Spark dataframe (created from a Hive table) looks like:

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|racist|filtered                                                                                                                                                      |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, 😍🔥, https://time.com/sxp3onz1w8]                                                                      |
|false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay]                                                                                |

and I am trying to remove urls from the filtered field.

I have tried:

val regex = "(https?\\://)\\S+".r

def removeRegex( input: Array[String] ) : Array[String]  = {
    regex.replaceAllIn(input, "")
}

val removeRegexUDF = udf(removeRegex)

filteredDF.withColumn("noURL", removeRegexUDF('filtered)).show

which gives this error:

<console>:60: error: overloaded method value replaceAllIn with alternatives:
  (target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and>
  (target: CharSequence,replacement: String)String
 cannot be applied to (Array[String], String)
           regex.replaceAllIn(input, "")
                 ^

I am very much a newbie at Scala so any guidance you can give on how to handle the filtered array in the udf is much appreciated. (Or if there is a better way of doing this I'm happy to hear it).

2
  • Your input is an Array of Strings, but the method expects just a single string in which every occurence of the regex is replaced. Commented Jun 30, 2017 at 10:51
  • this is not really related to spark, but a pure scala issue Commented Jun 30, 2017 at 11:45

2 Answers 2

3

I would not replace the URLs with empty strings but rather remove them. This UDF will do the trick:

val removeRegexUDF = udf(
  (input: Seq[String]) => input.filterNot(s => s.matches("(https?\\://)\\S+"))
)
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks that did it!
Can I add an OR in the s.matches bit so it is removed if it matches (URL OR something else)?
@schoon if course, i would to it like this : filterNot(s => s.matches(regex1) || s.matches(regex2))
Nice catch :) Non-standard thinking is helpful :)
1

Yes, you can.

At first, instead of Array the type should be Seq or WrappedArray. Secondly, function changes only one string to other string - not collection.

Your UDF should be:

def removeRegex(input: Seq[String]) : Array[String]  = {
    input.map(x => regex.replaceAllIn(x, "")).toArray
}

So map each element applying regular expression on it.

You can also use function regexp_replace from Spark functions

7 Comments

Thanks. That gave me this error: <console>:61: error: type mismatch; found : Seq[String] required: Array[String] input.map(regex.replaceAllIn(_, ""))
@schoon Are you using Scala Regexp?
@ T. er i think maybe not. I should be importing it right? But i am not.
@schoon You don't have to. See now my answer, I've changed code
Now I get: <console>:64: error: missing argument list for method removeRegex Unapplied methods are only converted to functions when a function type is expected. You can make this conversion explicit by writing removeRegex _ or removeRegex(_) instead of removeRegex. val removeRegexUDF = udf(removeRegex
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.