Multiple regex in Python findall

Question

Say I have a string : "She has an excellent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"

If observed properly, this string doesnt have any punctuation. I am primarily focusing on putting the periods. "She has an excellent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O" I can use a regex and findall to get a list of relevant words. I tried using something like this, but its not giving the desired result. I would like a computationally efficient code.

import re

text = "She has an excelllent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"

r = re.findall('([A-Z][a-z]+)|([a-zA-Z0-9]+)|([A-Z][a-z]+)', text)

Your regexp will match all words. How is that supposed to help you find where the periods belong? — Barmar
– Barmar, Commented Jul 2, 2021 at 9:11

Patrick Janser · Accepted Answer · 2021-07-02 10:41:53Z

1

I tried something like that with the PCRE engine : (\p{Ll}+)(\p{Lu}\p{Ll}*)

You can test it here: https://regex101.com/r/tqIcdS/1

The idea is to use the \p{L} to find any word character (like \w) but with handling unicode chars that might have accents (ex: "Le pain, je l'ai mangéEnsuite j'ai bu un verre de vin").

\p{Ll} matches a lowercase unicode word character.
\p{Lu} matches an uppercase unicode word character.

I also captured the characters before and after to match the whole word.

Unfortunately, Python's default re library doesn't support it.

But thanks to Wiktor's comment below, you could use the PyPi regex library: https://pypi.org/project/regex/

edited Jul 2, 2021 at 10:41

answered Jul 2, 2021 at 9:36

Patrick Janser

4,4631 gold badge20 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wiktor Stribiżew Over a year ago

Python re does not support Unicode property classes, but Python PyPi regex does.

Wiktor Stribiżew · Accepted Answer · 2021-07-02 10:48:55Z

You can use built-in Python re for both ASCII and fully Unicode-aware solutions:

import re, sys

pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
pLl = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).islower()]))

text = "She has an excelllent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"
print( re.sub(fr'({pLl})({pLu})', r'\1. \2', text) ) # Unicode-aware
# => She has an excelllent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O
print( re.sub(fr'([a-z])([A-Z])', r'\1. \2', text) ) # ASCII only
# => She has an excelllent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O

See the Python demo.

The main idea is to match and capture a lowercase letter and then an uppercase letter (([a-z])([A-Z])) and replace with Group 1 value + . and space and then Group 2 value, where \1 and \2 are backreferences to these group values.

Collectives™ on Stack Overflow

Multiple regex in Python findall

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related