1

I have a large pyspark dataframe with well over 50,000 rows of data. One column contains each record's document text that I am attempting to perform a regex search on.

Below is my built up regex code and pattern:

import re

words    = {"other", "this","that"}
maxInter = 3 # maximum intermediate words between the target words

wordSpan = len(words)+maxInter*(len(words)-1)

anyWord  = "|".join(words)
allWords = "".join(r"(?=(\w+\W*){0,SPAN}WORD\b)".replace("WORD",w) 
                    for w in words)
allWords = allWords.replace("SPAN",str(wordSpan-1))
                    
pattern = r"\bALL(\b(ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}"
pattern = pattern.replace("COUNT",str(len(words)))
pattern = pattern.replace("INTER",str(maxInter))
pattern = pattern.replace("ALL",allWords)
pattern = pattern.replace("ANY",anyWord)

print(pattern)

\b(?=(\w+\W*){0,8}that\b)(?=(\w+\W*){0,8}this\b)(?=(\w+\W*){0,8}other\b)(\b(that|this|other)(\W+\w+\W*){0,3}){3,3}

Below is what I'm trying to use to filter my pyspark dataframe, but something doesn't appear to be working right.

from pyspark.sql.functions import col

filtered = df.filter(col("attachment_text").rlike(pattern))

I've verified that this works on a regular list of strings and a pandas series, and while the above code runs (very quickly) without raising any errors, when I then try to get a simple row count (filtered.count()), my session just appears to sit there. It seems to be "working", but never seems to finish.

The fact that the filtering itself seems to move so fast on such a large dataset causes me to suspect that something might be wrong with that piece of the code, but I'm not certain. The large dataset should cause this to take longer, but the amount of time I'm waiting for a simple row count doesn't make sense.

Any ideas are appreciated!

4
  • did you try a piece of your data and your current regex on a website like regex101.com ? just to try if it matches anything Commented Aug 27, 2021 at 15:22
  • if you mean that this line filtered = df.filter(col("attachment_text").rlike(pattern)) is fast, it is totally normal. It is a tranformation, and transformations are lazy in spark. they are only computed when doing an action, count is an action. Commented Aug 27, 2021 at 15:24
  • This might be because of catastrophic backtracking, try replacing r"(?=(\w+\W*){0,SPAN}WORD\b)" with r"(?=(?:\w+\W+){0,SPAN}WORD\b)" and r"\bALL(\b(ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}" with r"\bALL(?:\b(?:ANY)(?:\W+\w+){0,INTER}){COUNT,COUNT}", i.e. replace \W* with \W+ (but remove in \W+\w+\W*) and make all groups non-capturing (replace ( with (?:). Commented Aug 27, 2021 at 15:26
  • @WiktorStribiżew When I try this on the same list of strings (a much smaller set than the pyspark dataframe), I don't get the same results as before. When I try this on pyspark, I get the same result of appearing to work but unclear of when it will finish or if it will finish at all. Commented Aug 27, 2021 at 15:51

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.