1

I am trying to use Regex to extract title cased phrases and word that occur within the sentences.

Effort so far:

(?:[A-Z][a-z]+\s?)+  

This regex code when applied on the sample sentence below finds those words shown as bold. But I need to ignore words like This and Whether (sentence starters).

Sample Sentence:

This is a Sample Sentence to check the Real Value of this code. Whether it works or Not depends upon the result.

Expectation:

This is a Sample Sentence to check the Real Value of this code. Whether it works or Not depends upon the result.

Useful code:

import regex as re

text='This is a Sample Sentence to check the Real Value of this code. Whether it works or Not depends upon the result. A State Of The Art Technology is needed to do this work.'
rex=r'(?<!^|[.!?]\ )\b[A-Z][a-z]+(?:\ [A-Z][a-z]+)*\b'

matches = re.finditer(rex,text)
results = [match[0] for match in matches]
print(results)

Result:

['Sample Sentence', 'Real Value', 'Not', 'State Of The Art Technology']

2 Answers 2

3

Assuming your regex flavor supports Lookbehinds, I would use something like this:

(?<!^|\.\ )\b[A-Z][a-z]+(?:\ [A-Z][a-z]+)*\b

Demo.

This will support words that are preceded by an abbreviation, punctuation, or pretty much anything other than a period (end of previous sentence).


Edit:

As per Nick's suggestion in the comments, it's probably better to include ! and ? in the Lookbehind to support sentences ending with either of them, not just the period:

(?<!^|[.!?]\ )\b[A-Z][a-z]+(?:\ [A-Z][a-z]+)*\b

Demo.

Sign up to request clarification or add additional context in comments.

4 Comments

You should probably allow for sentences to end in ? or ! as well.
@Nick It can indeed end with anything. It just won't be included in the match (which is what I believe the OP intended). I think that the OP's pattern including the trailing whitespace was just a side-effect of them trying to include multiple consecutive words in one match.
I was referring to your negative lookbehind, it would be better with [.!?] than \.
@Programmer_nltk you might find this solution more flexible if your sentence structure can include punctuation in the middle of them.
2

If your sentence is always single spaced, you can use a positive lookbehind for a letter and a space to find the start of a title-cased expression:

(?<=[a-z,] )(?:[A-Z][a-z]+(?![a-z]).)+

This regex allows for the expression to end in punctuation instead of just a space (e.g. the Final Result.).

Demo on regex101

2 Comments

This will not match things like Bb in AA Bb or Aa, Bb. Although not included in OP's example, I think they would want to match those as well (I could be wrong thoough).
@AhmedAbdelhameed I'm not sure about AA Bb but Aa, Bb might well be valid. It's a good point.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.