2

I need to find a combination of 2 consecutive title case words.

This is my code so far,

text='Hi my name is Moh Shai and This Is a Python Code with Regex and Needs Some Expertise'

rex=r'[A-Z][a-z]+\s+[A-Z][a-z]+'

re.findall(rex,text)

This gives me,

['Moh Shai', 'This Is', 'Python Code', 'Needs Some']

However, I need all the combinations. Something like,

['Moh Shai', 'This Is', 'Python Code', 'Needs Some','Some Expertise']

Can someone please help?

4
  • Does this help? Commented Apr 19, 2016 at 23:34
  • 2
    If you can install a third-party module, the easiest way is with the regex module, which supports an overlapped=True flag on findall(). Commented Apr 19, 2016 at 23:39
  • @kindall you are awesome. That works great! Can you please post an answer so I may accept? Commented Apr 19, 2016 at 23:41
  • Please see: stackoverflow.com/questions/5616822/… Commented Apr 19, 2016 at 23:49

3 Answers 3

4

You can use a regex lookahead in combination with the re.finditer function in order to get the desired outcome:

import re

text='Hi my name is Moh Shai and This Is a Python Code with Regex and Needs Some Expertise'
rex=r'(?=([A-Z][a-z]+\s+[A-Z][a-z]+))'

matches = re.finditer(rex,text)
results = [match.group(1) for match in matches]

Now results will contain the information you need:

>>> results
['Moh Shai', 'This Is', 'Python Code', 'Needs Some', 'Some Expertise']

edit: For what it's worth, you don't even really need the finditer function. You can replace those bottom two lines with your previous line re.findall(rex,text) for the same effect.

Sign up to request clarification or add additional context in comments.

2 Comments

This answer identifies only Title Case of 2 words, it would fail on "The United States Of America"
Yes, as requested in the question.
3

I came to this question by It's title and was disappointed that the solution wasn't what I expected.

The accepted answer only works for titles of exactly 2 words

This code would return all of the tokens that are in title case, without assuming anything on the amount of words in the title

import re, collections
def title_case_to_token(c):
    totoken = lambda s: s[0] + "<" + s[1:-2].replace(" ","_") + ">" + s[-2:]
    tokenized = re.sub("([\s\.\,;]([A-Z][a-z]+[\s\.\,;])+[^A-Z])", lambda m: totoken(m.group(0))," " + c + " x")[1:-2]
    tokens = collections.Counter(re.compile("<\w+>").findall(tokenized))
    return (tokens, tokenized)

For example

text='Hi my name is Moh Shai and This Is a Python Code with Regex and Needs Some Expertise'
tokens, tokenized = title_case_to_token(text)

Value of tokens

Counter({'<Hi>': 1, '<Moh_Shai>': 1, '<This_Is>': 1, '<Python_Code>': 1, '<Regex>': 1, '<Needs_Some_Expertise>': 1})

Note that Needs_Some_Expertise is also caught by this regex, and it has 3 words

Value of tokenized

<Hi> my name is <Moh_Shai> and <This_Is> a <Python_Code> with <Regex> and <Needs_Some_Expertise>

Comments

1

If you can install a third-party module, the easiest way is with the regex module, which supports an overlapped=True flag on findall().

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.