2

I've made RDD from a file with list of urls:

url_data = sc.textFile("url_list.txt")

Now i'm trying to make another RDD with all rows that contain 'net.com' and this string starts with non numeric or letter symbol. I mean include lines with .net.com or \tnet.com and exclude internet.com or cnet.com.

filtered_data = url_data.filter(lambda x: '[\W]net\.com' in x)

But this line gives no results. How can i make pyspark shell work with regex?

5
  • 1
    '[\W]net\.com' in '.net.com' returns False in python. So, it's a python issue not a pyspark issue Commented Jun 14, 2016 at 16:52
  • any clue what correct regular expression would be? Commented Jun 14, 2016 at 17:01
  • \.+[a-zA-Z]+\.com looks like the regex command you want (test here: regexr.com). But it doesn't seamlessly integrate in python like you need. It looks like you may be able to use this in an SQL query (example here: stackoverflow.com/questions/34952985/…) Commented Jun 14, 2016 at 17:31
  • @lacerated: Did you try r"\Wnet\.com"? Does that also throw any errors? Actually, the regex [\W]net\.com works well in Python: ideone.com/kx8U3V Commented Jun 14, 2016 at 19:12
  • @WiktorStribiżew: Yep tried r"\Wnet\.com" too. I get no errors when i try to execute an action like filtered_data.take(1) just an empty string []. If i try filtered_data.first() it says: ValueError: RDD is empty. Wonder if i need to import some pyspark module to make reg expressions work in spark like import re in python. If i do just filtered_data = url_data.filter(lambda x: 'net.com' in x) it filters just fine but i get extra lines i don`t need. Commented Jun 14, 2016 at 22:31

1 Answer 1

2

Why not define a function in python that uses the re or re2 (much faster) package, and returns a Bool if there is a match.

def url_filter(url):
    pattern = re.compile(r'REGEX_PATTERN')
    match = pattern.match(URL)
    if match:
        return True
    else:
        return False

Then just pass it in to filter function url_data.filter(lambda x: python_regex_fuction(x))

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.