5

I tried to use regexes for finding max-length sequences formed from repeated doubled letters, like AABB in the string xAAABBBBy.

As described in the official documentation:

The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible.

When I use the quantifier {n,}, I get a full substring, but + returns only parts:

import re

print(re.findall("((AA|BB){3,})", "xAAABBBBy"))
# [('AABBBB', 'BB')]
print(re.findall("((AA|BB)+)", "xAAABBBBy"))
# [('AA', 'AA'), ('BBBB', 'BB')]

Why is {n,} more greedy than +?

4
  • 2
    because the leftmost match wins. The second pattern matches the two first "A", can't match the remaining A, and matches the Bs. The first pattern starts at the second A. Commented Mar 24, 2023 at 19:56
  • 3
    Besides (AA|BB){3,} is not really equivalent of (AA|BB)+. (AA|BB){1,} will behave same Commented Mar 24, 2023 at 19:58
  • 1
    Both patterns are greedy and matches as much as possible chars. But the second pattern can match 1 or more times but not 2 times. As the quantifier is 1 or more times, the pattern has a match. But the "rule" for the first pattern is 3 or more times, so the pattern can not match BB after the first AA in AAABB and moves to the second A and tries again and then it can match at least 3 times either AA or BB Commented Mar 24, 2023 at 20:08
  • @Anton Ganichev by the way, what was your expected output, the "max length sequence" in the string? Commented Mar 24, 2023 at 20:51

1 Answer 1

5

Both quantifiers {3,} and + are greedy in the same way.

First, let's simplify the output a little bit by changing the inner group into a non-capturing one:

import re

print(re.findall("((?:AA|BB){3,})", "xAAABBBBy"))
# ['AABBBB']
print(re.findall("((?:AA|BB)+)", "xAAABBBBy"))
# ['AA', 'BBBB']

The first pattern requires a repetition (with the total number of occurrences being at least 3 – let's call this multiplicity ≥3), so the only possible match starts at the second A:

x AA A B B B By

The second pattern requires only multiplicity ≥1. As the string is scanned left-to-right, the first (left-most) possible greedy match is formed by the first two As. For the remaining string ABBBBy, the first (left-most) possible greedy match is BBBB. After that, only y remains, which can't be matched.

xA AAB B B By

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.