I've made RDD from a file with list of urls:
url_data = sc.textFile("url_list.txt")
Now i'm trying to make another RDD with all rows that contain 'net.com' and this string starts with non numeric or letter symbol. I mean include lines with .net.com or \tnet.com and exclude internet.com or cnet.com.
filtered_data = url_data.filter(lambda x: '[\W]net\.com' in x)
But this line gives no results. How can i make pyspark shell work with regex?
'[\W]net\.com' in '.net.com'returns False in python. So, it's a python issue not a pyspark issuer"\Wnet\.com"? Does that also throw any errors? Actually, the regex[\W]net\.comworks well in Python: ideone.com/kx8U3Vr"\Wnet\.com"too. I get no errors when i try to execute an action likefiltered_data.take(1)just an empty string[]. If i tryfiltered_data.first()it says:ValueError: RDD is empty. Wonder if i need to import some pyspark module to make reg expressions work in spark likeimport rein python. If i do justfiltered_data = url_data.filter(lambda x: 'net.com' in x)it filters just fine but i get extra lines i don`t need.