Why is the regex quantifier {n,} more greedy than + (in Python)?

Question

I tried to use regexes for finding max-length sequences formed from repeated doubled letters, like AABB in the string xAAABBBBy.

As described in the official documentation:

The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible.

When I use the quantifier {n,}, I get a full substring, but + returns only parts:

import re

print(re.findall("((AA|BB){3,})", "xAAABBBBy"))
# [('AABBBB', 'BB')]
print(re.findall("((AA|BB)+)", "xAAABBBBy"))
# [('AA', 'AA'), ('BBBB', 'BB')]

Why is {n,} more greedy than +?

because the leftmost match wins. The second pattern matches the two first "A", can't match the remaining A, and matches the Bs. The first pattern starts at the second A. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Mar 24, 2023 at 19:56
Besides (AA|BB){3,} is not really equivalent of (AA|BB)+. (AA|BB){1,} will behave same — anubhava
– anubhava, Commented Mar 24, 2023 at 19:58
Both patterns are greedy and matches as much as possible chars. But the second pattern can match 1 or more times but not 2 times. As the quantifier is 1 or more times, the pattern has a match. But the "rule" for the first pattern is 3 or more times, so the pattern can not match BB after the first AA in AAABB and moves to the second A and tries again and then it can match at least 3 times either AA or BB — The fourth bird
– The fourth bird, Commented Mar 24, 2023 at 20:08
@Anton Ganichev by the way, what was your expected output, the "max length sequence" in the string? — cards
– cards, Commented Mar 24, 2023 at 20:51

Lover of Structure · Accepted Answer · 2023-05-28 10:03:21Z

5

Both quantifiers {3,} and + are greedy in the same way.

First, let's simplify the output a little bit by changing the inner group into a non-capturing one:

import re

print(re.findall("((?:AA|BB){3,})", "xAAABBBBy"))
# ['AABBBB']
print(re.findall("((?:AA|BB)+)", "xAAABBBBy"))
# ['AA', 'BBBB']

The first pattern requires a repetition (with the total number of occurrences being at least 3 – let's call this multiplicity ≥3), so the only possible match starts at the second A:

x A ⟨A A B B B B⟩ y

The second pattern requires only multiplicity ≥1. As the string is scanned left-to-right, the first (left-most) possible greedy match is formed by the first two As. For the remaining string ABBBBy, the first (left-most) possible greedy match is BBBB. After that, only y remains, which can't be matched.

x ⟨A A⟩ A ⟨B B B B⟩ y

edited May 28, 2023 at 10:03

answered Mar 24, 2023 at 20:44

Lover of Structure

1,8883 gold badges13 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Why is the regex quantifier {n,} more greedy than + (in Python)?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related