I have a large pyspark dataframe with well over 50,000 rows of data. One column contains each record's document text that I am attempting to perform a regex search on.
Below is my built up regex code and pattern:
import re
words = {"other", "this","that"}
maxInter = 3 # maximum intermediate words between the target words
wordSpan = len(words)+maxInter*(len(words)-1)
anyWord = "|".join(words)
allWords = "".join(r"(?=(\w+\W*){0,SPAN}WORD\b)".replace("WORD",w)
for w in words)
allWords = allWords.replace("SPAN",str(wordSpan-1))
pattern = r"\bALL(\b(ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}"
pattern = pattern.replace("COUNT",str(len(words)))
pattern = pattern.replace("INTER",str(maxInter))
pattern = pattern.replace("ALL",allWords)
pattern = pattern.replace("ANY",anyWord)
print(pattern)
\b(?=(\w+\W*){0,8}that\b)(?=(\w+\W*){0,8}this\b)(?=(\w+\W*){0,8}other\b)(\b(that|this|other)(\W+\w+\W*){0,3}){3,3}
Below is what I'm trying to use to filter my pyspark dataframe, but something doesn't appear to be working right.
from pyspark.sql.functions import col
filtered = df.filter(col("attachment_text").rlike(pattern))
I've verified that this works on a regular list of strings and a pandas series, and while the above code runs (very quickly) without raising any errors, when I then try to get a simple row count (filtered.count()), my session just appears to sit there. It seems to be "working", but never seems to finish.
The fact that the filtering itself seems to move so fast on such a large dataset causes me to suspect that something might be wrong with that piece of the code, but I'm not certain. The large dataset should cause this to take longer, but the amount of time I'm waiting for a simple row count doesn't make sense.
Any ideas are appreciated!
filtered = df.filter(col("attachment_text").rlike(pattern))is fast, it is totally normal. It is a tranformation, and transformations are lazy in spark. they are only computed when doing an action,countis an action.r"(?=(\w+\W*){0,SPAN}WORD\b)"withr"(?=(?:\w+\W+){0,SPAN}WORD\b)"andr"\bALL(\b(ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}"withr"\bALL(?:\b(?:ANY)(?:\W+\w+){0,INTER}){COUNT,COUNT}", i.e. replace\W*with\W+(but remove in\W+\w+\W*) and make all groups non-capturing (replace(with(?:).