How do I use string array as parameter in Scala udf?

Question

My Spark dataframe (created from a Hive table) looks like:

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|racist|filtered                                                                                                                                                      |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, 😍🔥, https://time.com/sxp3onz1w8]                                                                      |
|false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay]                                                                                |

and I am trying to remove urls from the filtered field.

I have tried:

val regex = "(https?\\://)\\S+".r

def removeRegex( input: Array[String] ) : Array[String]  = {
    regex.replaceAllIn(input, "")
}

val removeRegexUDF = udf(removeRegex)

filteredDF.withColumn("noURL", removeRegexUDF('filtered)).show

which gives this error:

<console>:60: error: overloaded method value replaceAllIn with alternatives:
  (target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and>
  (target: CharSequence,replacement: String)String
 cannot be applied to (Array[String], String)
           regex.replaceAllIn(input, "")
                 ^

I am very much a newbie at Scala so any guidance you can give on how to handle the filtered array in the udf is much appreciated. (Or if there is a better way of doing this I'm happy to hear it).

Your input is an Array of Strings, but the method expects just a single string in which every occurence of the regex is replaced. — Secespitus
– Secespitus, Commented Jun 30, 2017 at 10:51

Raphael Roth · Accepted Answer · 2017-06-30 11:52:55Z

3

I would not replace the URLs with empty strings but rather remove them. This UDF will do the trick:

val removeRegexUDF = udf(
  (input: Seq[String]) => input.filterNot(s => s.matches("(https?\\://)\\S+"))
)

answered Jun 30, 2017 at 11:52

Raphael Roth

27.3k19 gold badges98 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

schoon Over a year ago

Thanks that did it!

schoon Over a year ago

Can I add an OR in the s.matches bit so it is removed if it matches (URL OR something else)?

Raphael Roth Over a year ago

@schoon if course, i would to it like this : filterNot(s => s.matches(regex1) || s.matches(regex2))

T. Gawęda Over a year ago

Nice catch :) Non-standard thinking is helpful :)

T. Gawęda · Accepted Answer · 2017-06-30 11:40:19Z

1

Yes, you can.

At first, instead of Array the type should be Seq or WrappedArray. Secondly, function changes only one string to other string - not collection.

Your UDF should be:

def removeRegex(input: Seq[String]) : Array[String]  = {
    input.map(x => regex.replaceAllIn(x, "")).toArray
}

So map each element applying regular expression on it.

You can also use function regexp_replace from Spark functions

edited Jun 30, 2017 at 11:40

answered Jun 30, 2017 at 11:17

T. Gawęda

16.1k5 gold badges51 silver badges62 bronze badges

7 Comments

schoon Over a year ago

Thanks. That gave me this error: <console>:61: error: type mismatch; found : Seq[String] required: Array[String] input.map(regex.replaceAllIn(_, ""))

T. Gawęda Over a year ago

@schoon Are you using Scala Regexp?

schoon Over a year ago

@ T. er i think maybe not. I should be importing it right? But i am not.

T. Gawęda Over a year ago

@schoon You don't have to. See now my answer, I've changed code

schoon Over a year ago

Now I get: <console>:64: error: missing argument list for method removeRegex Unapplied methods are only converted to functions when a function type is expected. You can make this conversion explicit by writing removeRegex _ or removeRegex(_) instead of removeRegex. val removeRegexUDF = udf(removeRegex

|

Collectives™ on Stack Overflow

How do I use string array as parameter in Scala udf?

2 Answers 2

4 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related