2

I am trying to split this sentence

"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot " \
"for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this " \
"isn't true... Well, with a probability of .9 it isn't."

Into list of below.

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

Code:

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )[^a-z]',text)

Output:

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', "Adam Jones Jr. thinks he didn't. "]

K gud, but it missed some, is there a way to tell Python since last [^a-z] isn't part of my group, pls continue searching from there.

EDIT:

This was achieved through forward look ahead regex as mentioned by @sputnick.

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )(?=[^a-z])',text)

Output:

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', 'Did he mind? ', "Adam Jones Jr. thinks he didn't. "
, "In any case, this isn't true... "]

But we still need the last sentence. Any ideas?

1

3 Answers 3

2

Try this :

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )(?=[^a-z])',text)

using positive look-ahead regex technique, check http://www.regular-expressions.info/lookaround.html

Sign up to request clarification or add additional context in comments.

3 Comments

wow, regex are awesome, works perfect. Thx @sputnick. What is ?= actually meant for?
This is the syntax for positive look-ahead, check the added link in my answer
nice tutorial at the link, could there be a way also to include the last sentence saying exclude looking after the dot for a space and [^a-z] it its end of file. Something like word boundaries
1
(.+?)(?<=(?<![A-Z][a-z])(?<![a-z]\.[a-z])(?:\.|\?)(?=\s|$))

Try this.See demo.Grab the capture or groups.Use re.findall.

https://regex101.com/r/gQ3kS4/45

Comments

0

Finally

 print re.findall('[A-Z]+[^.].*?[a-z.][.?!] (?=[^a-z])|.*.$',text)

Above works perfect as needed. Includes the last sentence. But I have no idea why |.*.$ worked pls help me understand.

Output:

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', 'Did he mind? ', "Adam Jones Jr. thinks he didn't. "
, "In any case, this isn't true... ", "Well, with a probability of .9 
it isn't."] 

1 Comment

There is no space at the end: re.findall('[A-Z]+[^.].*?[a-z.][.?!](?: (?=[^a-z])|$)', text)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.