2 orders of magnitude performance change with seemingly small python regex change

Question

I have a slow python regular expression, if I simply remove the last line of the regex, the speed increases by two orders of magnitude! Here's my reproducing example:

import re
import timeit

mystr = "14923KMLI MLI2010010206581258   0.109 b                              M     M     M     0    09 60+              "

basere = r"""
(?P<wban>[0-9]{5})
(?P<faaid>[0-9A-Z]{4})\s
(?P<id3>[0-9A-Z]{3})
(?P<tstamp>[0-9]{16})\s+
\[?\s*((?P<vis1_coef>\-?\d+\.\d*)|(?P<vis1_coef_miss>M))\s*\]?\s*
\[?(?P<vis1_nd>[0-9A-Za-z\?\$/ ])\]?\s+
((?P<vis2_coef>\d+\.\d*)|(?P<vis2_coef_miss>[M ]))\s+(?P<vis2_nd>[A-Za-z\?\$ ])\s+
...............\s+
\[?\s*((?P<drct>\d+)|(?P<drct_miss>M))\s+
((?P<sknt>\d+)|(?P<sknt_miss>M))\s+
((?P<gust_drct>\d+)\+?|(?P<gust_drct_miss>M))\s*\]?\s+
"""
additional = r"""
\[?((?P<gust_sknt>\d+)R?L?F*\d*\+?|(?P<gust_sknt_miss>M))\s*\]?\s+
"""

P1_RE = re.compile(basere + additional, re.VERBOSE)
P2_RE = re.compile(basere, re.VERBOSE)


for myre in ["P1_RE", "P2_RE"]:
    statement = "%s.match('%s')" % (myre, mystr)
    res = timeit.timeit(statement, "from __main__ import %s" % (myre,), 
                        number=1000)
    print('%s took %.9f per iteration' % (myre, res / 1000.))

# result on my laptop, python 2.6 and 3.3 tested
# P1_RE took 0.001489143 per iteration
# P2_RE took 0.000019991 per iteration

So the only difference between P1_RE and P2_RE is the additional regex. Any ideas as to what I am doing wrong?

Can you reduce the scope of the regex and reproduce the behavior? — Daenyth
– Daenyth, Commented Aug 14, 2014 at 20:53
@Daenyth I am not positive what you are asking me to do. I have tried decreasing the size of the regex, but the major change in performance only happens with the difference illustrated. — daryl
– daryl, Commented Aug 14, 2014 at 21:00
Reduce the regex to /trivial/ vs /trivial<withAdditional>/ and check for the difference again. As written your code paste is indecipherable and you're not likely to get much help — Daenyth
– Daenyth, Commented Aug 14, 2014 at 21:02
Those regexes are extremely hard to parse without knowing in advance what they intend to match. — chepner
– chepner, Commented Aug 14, 2014 at 21:04
Given the large number of optional items present in the second half or so, there seems to be a lot of opportunity for backtracking in that regex. Adding more to the end could trigger exponential behavior in the engine (i.e., the string doesn't match, but there are many more ways for it to not match that must be checked before we can be certain.) — chepner
– chepner, Commented Aug 14, 2014 at 21:16

Jerry · Accepted Answer · 2014-08-14 21:24:51Z

It's because of backtracking; which is a prominent cause of inefficiency in regexes.

The last line that you removed requires matching at least 1 digit or M, followed by spaces, and many many many optional things along the way.

The second to last line:

((?P<gust_drct>\d+)\+?|(?P<gust_drct_miss>M))\s*\]?\s+

initially matches the last part, that is 60+ and the spaces that follow, leaving nothing for the last line. As such, the regex fails to match and backtracks one character at a time to try matching the last line. Thing is, there are a lot of things the regex will try to match and for each character that the regex backtracked, there are more and more attempts to match.

Before the regex matches, it has actually rejected quite a lot of characters that the previous lines of the regex had matched before.

A simple example would be the string (10 a's):

aaaaaaaaaa

And the regex:

a+a+a+a+a+a+a+a+a+a+

The first part of the regex a+ will match the whole string (because + is greedy), but then, it needs to match more a because the next part of the regex is a+. The engine then backtrack one character so that the first a+ matches aaaaaaaaa (9 a's) and the second matches the last a.

There is a third a+ next and since there are no more a to match, the regex will backtrack one character. Since the second a cannot backtrack, it will be the first a+ that does and matches aaaaaaaa (8 a's), then the second a+ will match the last aa. Now there's no more a, so the second a+ will backtrack to match the second to last a, and finally, the third a+ can match the last a.

See how much the first 3 a+ did? Now imagine that going through each and every possible match until everything matched?

Usually, if you don't want backtracking at all, you would use atomic groups and possessive quantifiers, but python doesn't support those, at least, not the re module (the regex module does, however).

You might want to revise your regex and try to see what it should really match, what is really optional, especially the possible spaces in your capture groups.

thank you. Your post has me going in the right direction. An initial stab was to remove any "[" "]" from the input string before the regex runs and remove all those optional "[?" "]?" from the regex. That sped things up a bunch already.

Collectives™ on Stack Overflow

2 orders of magnitude performance change with seemingly small python regex change

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related