3

Say I have a string : "She has an excellent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"

If observed properly, this string doesnt have any punctuation. I am primarily focusing on putting the periods. "She has an excellent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O" I can use a regex and findall to get a list of relevant words. I tried using something like this, but its not giving the desired result. I would like a computationally efficient code.

import re

text = "She has an excelllent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"

r = re.findall('([A-Z][a-z]+)|([a-zA-Z0-9]+)|([A-Z][a-z]+)', text)
3
  • re.sub(r'([a-z])([A-Z])', r'\1. \2', text) Commented Jul 2, 2021 at 8:58
  • 2
    Your first and last alternatives are the same, aren't they? Commented Jul 2, 2021 at 9:10
  • 2
    Your regexp will match all words. How is that supposed to help you find where the periods belong? Commented Jul 2, 2021 at 9:11

2 Answers 2

1

I tried something like that with the PCRE engine : (\p{Ll}+)(\p{Lu}\p{Ll}*)

You can test it here: https://regex101.com/r/tqIcdS/1

The idea is to use the \p{L} to find any word character (like \w) but with handling unicode chars that might have accents (ex: "Le pain, je l'ai mangéEnsuite j'ai bu un verre de vin").

  • \p{Ll} matches a lowercase unicode word character.

  • \p{Lu} matches an uppercase unicode word character.

I also captured the characters before and after to match the whole word.

Unfortunately, Python's default re library doesn't support it.

But thanks to Wiktor's comment below, you could use the PyPi regex library: https://pypi.org/project/regex/

Sign up to request clarification or add additional context in comments.

1 Comment

Python re does not support Unicode property classes, but Python PyPi regex does.
0

You can use built-in Python re for both ASCII and fully Unicode-aware solutions:

import re, sys

pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
pLl = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).islower()]))

text = "She has an excelllent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"
print( re.sub(fr'({pLl})({pLu})', r'\1. \2', text) ) # Unicode-aware
# => She has an excelllent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O
print( re.sub(fr'([a-z])([A-Z])', r'\1. \2', text) ) # ASCII only
# => She has an excelllent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O

See the Python demo.

The main idea is to match and capture a lowercase letter and then an uppercase letter (([a-z])([A-Z])) and replace with Group 1 value + . and space and then Group 2 value, where \1 and \2 are backreferences to these group values.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.