1

I would like to capture all [[A-Za-z].]+ in my string, that is, all repeats of a alphabetic character followed by a dot.

So for example, in "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z."

I would like to pull out "A.B.C." and "U.V.W.X." only (as they are repeats of one character followed by a dot).

It seems almost that I need a recursive regex to do this [[A-Za-z].]+.

Is it possible to implement this with either python's re module or regex module?

1
  • I don't think recursion is the right word here. Repetition would be a better way to describe this idea accurately. Commented Feb 14, 2017 at 1:23

4 Answers 4

1

You can use a non-capturing group to define your match, then group its repeats nested between boundary characters (in this case anything that's not a letter or a dot) and capture all matched groups:

<!-- language: lang-py -->

import re

MATCH_GROUPS = re.compile(r"(?:[^a-z.]|^)((?:[a-z]\.)+)(?:[^a-z.]|$)", re.IGNORECASE)

your_string = "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z."  # get a list of matches

print(MATCH_GROUPS.findall(your_string))  # ['A.B.C.', 'U.V.W.X.']

A bit clunky but should get the job done with edge cases as well.

P.S. The above will match single occurrences as well (e.g. A. if it appears as standalone) if you're seeking for multiple repeats only, replace the + (one or more repeats) with a range of your choice (e.g. {2,} for two or more repeats).

edit: A small change to match beginning/end of string boundaries as well.

Sign up to request clarification or add additional context in comments.

Comments

1

This will work for you, using simple re.findall notation:

(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+

In the regex, I first check if it is the start of the string, or if there is a space before the string, and then i check for repetitive letter+period. I place the parts i do not want to capture into a non-capture group (?:...)

You can see it working here: https://regex101.com/r/ZwW7c7/4

Python Code (that I wrote):

import re
regex = r"(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+"
string = 'D.E.F. ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.'
print(re.findall(regex,string))

Output:

['D.E.F.', 'A.B.C.', 'U.V.W.X.']

6 Comments

This does not match leading A.B.C. out of A.B.C XYZ.
You are right, we need to check for \s or beginning of string, i will update!
You don't need increase matchNum by 1 manually. enumerate accepts optional start parameter: enumerate(matches, 1)
Code is auto-generated by regex101 :)
code is now updated to work for A.B.C. ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.. Output is A.B.C. A.B.C. U.V.W.X.
|
1

Using positive look-around assertions:

>>> import re
>>> pattern = r'(?:(?<=\s)|^)(?:[A-Za-z]\.)+(?:(?=\s)|$)'
>>> re.findall(pattern, 'ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'DEF A.B.C. UVWX U.V.W.X.Y')
['A.B.C.']

UPDATE As @bubblebobble suggested, you the regex could be simplified using \S (non-space character) with negative look-around assertions:

pattern = r'(?<!\S)(?:[A-Za-z]\.)+(?!\S)'

1 Comment

@bobblebubble, Thank you for the comment. I will update the answer to include your regular expression.
0

This regex seems to do the job (testing if we are on the beginning of the string or after a space) :

\A([A-Za-z]\.)+|(?<=\s)([A-Za-z]\.)+

EDIT : Sorry Shawn didn't see your modified answer

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.