pyspark not working with regex

Question

I've made RDD from a file with list of urls:

url_data = sc.textFile("url_list.txt")

Now i'm trying to make another RDD with all rows that contain 'net.com' and this string starts with non numeric or letter symbol. I mean include lines with .net.com or \tnet.com and exclude internet.com or cnet.com.

filtered_data = url_data.filter(lambda x: '[\W]net\.com' in x)

But this line gives no results. How can i make pyspark shell work with regex?

'[\W]net\.com' in '.net.com' returns False in python. So, it's a python issue not a pyspark issue — David
– David, Commented Jun 14, 2016 at 16:52
\.+[a-zA-Z]+\.com looks like the regex command you want (test here: regexr.com). But it doesn't seamlessly integrate in python like you need. It looks like you may be able to use this in an SQL query (example here: stackoverflow.com/questions/34952985/…) — David
– David, Commented Jun 14, 2016 at 17:31
@lacerated: Did you try r"\Wnet\.com"? Does that also throw any errors? Actually, the regex [\W]net\.com works well in Python: ideone.com/kx8U3V — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jun 14, 2016 at 19:12
@WiktorStribiżew: Yep tried r"\Wnet\.com" too. I get no errors when i try to execute an action like filtered_data.take(1) just an empty string []. If i try filtered_data.first() it says: ValueError: RDD is empty. Wonder if i need to import some pyspark module to make reg expressions work in spark like import re in python. If i do just filtered_data = url_data.filter(lambda x: 'net.com' in x) it filters just fine but i get extra lines i don`t need. — lacerated
– lacerated, Commented Jun 14, 2016 at 22:31

Mansweet · Accepted Answer · 2016-09-22 20:11:24Z

2

Why not define a function in python that uses the re or re2 (much faster) package, and returns a Bool if there is a match.

def url_filter(url):
    pattern = re.compile(r'REGEX_PATTERN')
    match = pattern.match(URL)
    if match:
        return True
    else:
        return False

Then just pass it in to filter function url_data.filter(lambda x: python_regex_fuction(x))

answered Sep 22, 2016 at 20:11

Mansweet

1517 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark not working with regex

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related