Recursive regex in python regex module?

Question

I would like to capture all [[A-Za-z].]+ in my string, that is, all repeats of a alphabetic character followed by a dot.

So for example, in "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z."

I would like to pull out "A.B.C." and "U.V.W.X." only (as they are repeats of one character followed by a dot).

It seems almost that I need a recursive regex to do this [[A-Za-z].]+.

Is it possible to implement this with either python's re module or regex module?

I don't think recursion is the right word here. Repetition would be a better way to describe this idea accurately. — Mad Physicist
– Mad Physicist, Commented Feb 14, 2017 at 1:23

zwer · Accepted Answer · 2017-02-14 01:42:07Z

1

You can use a non-capturing group to define your match, then group its repeats nested between boundary characters (in this case anything that's not a letter or a dot) and capture all matched groups:

<!-- language: lang-py -->

import re

MATCH_GROUPS = re.compile(r"(?:[^a-z.]|^)((?:[a-z]\.)+)(?:[^a-z.]|$)", re.IGNORECASE)

your_string = "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z."  # get a list of matches

print(MATCH_GROUPS.findall(your_string))  # ['A.B.C.', 'U.V.W.X.']

A bit clunky but should get the job done with edge cases as well.

P.S. The above will match single occurrences as well (e.g. A. if it appears as standalone) if you're seeking for multiple repeats only, replace the + (one or more repeats) with a range of your choice (e.g. {2,} for two or more repeats).

edit: A small change to match beginning/end of string boundaries as well.

edited Feb 14, 2017 at 1:42

answered Feb 14, 2017 at 1:32

zwer

25.9k3 gold badges53 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Shawn Tabrizi · Accepted Answer · 2017-02-14 01:42:50Z

1

This will work for you, using simple re.findall notation:

(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+

In the regex, I first check if it is the start of the string, or if there is a space before the string, and then i check for repetitive letter+period. I place the parts i do not want to capture into a non-capture group (?:...)

You can see it working here: https://regex101.com/r/ZwW7c7/4

Python Code (that I wrote):

import re
regex = r"(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+"
string = 'D.E.F. ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.'
print(re.findall(regex,string))

Output:

['D.E.F.', 'A.B.C.', 'U.V.W.X.']

edited Feb 14, 2017 at 1:42

answered Feb 14, 2017 at 1:19

Shawn Tabrizi

12.5k2 gold badges43 silver badges73 bronze badges

6 Comments

falsetru Over a year ago

This does not match leading A.B.C. out of A.B.C XYZ.

Shawn Tabrizi Over a year ago

You are right, we need to check for \s or beginning of string, i will update!

falsetru Over a year ago

You don't need increase matchNum by 1 manually. enumerate accepts optional start parameter: enumerate(matches, 1)

Shawn Tabrizi Over a year ago

Code is auto-generated by regex101 :)

Shawn Tabrizi Over a year ago

code is now updated to work for A.B.C. ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.. Output is A.B.C. A.B.C. U.V.W.X.

|

falsetru · Accepted Answer · 2017-02-14 07:39:09Z

1

Using positive look-around assertions:

>>> import re
>>> pattern = r'(?:(?<=\s)|^)(?:[A-Za-z]\.)+(?:(?=\s)|$)'
>>> re.findall(pattern, 'ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'DEF A.B.C. UVWX U.V.W.X.Y')
['A.B.C.']

UPDATE As @bubblebobble suggested, you the regex could be simplified using \S (non-space character) with negative look-around assertions:

pattern = r'(?<!\S)(?:[A-Za-z]\.)+(?!\S)'

edited Feb 14, 2017 at 7:39

answered Feb 14, 2017 at 1:36

falsetru

371k69 gold badges769 silver badges659 bronze badges

1 Comment

falsetru Over a year ago

@bobblebubble, Thank you for the comment. I will update the answer to include your regular expression.

VdF · Accepted Answer · 2017-02-14 01:34:58Z

0

This regex seems to do the job (testing if we are on the beginning of the string or after a space) :

\A([A-Za-z]\.)+|(?<=\s)([A-Za-z]\.)+

EDIT : Sorry Shawn didn't see your modified answer

answered Feb 14, 2017 at 1:34

VdF

495 bronze badges

Collectives™ on Stack Overflow

Recursive regex in python regex module?

4 Answers 4

Comments

6 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related