Advanced string parsing in python

Question

I've encountered a problem while trying to parse a complicated string. The string is really long and full of patterns but lets focus on what i need to take (and only that).

A substring from the huge string is:

... [span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" ...

Now I want to take the bold italic text, and i have the pattern, starts with [span class = ..]*[/span] desired text [div ... ] and this pattern repeates through the whole string.

How exactly do I take this specific text from the whole string and write it line after line?

Do you really want to parse this with regex? It looks like it's just HTML with the angle brackets changed into square brackets and the quotes escaped, and the same reasons that make regex bad for HTML will almost certainly make regex bad for this language. — abarnert
– abarnert, Commented May 6, 2015 at 23:31
Actually, from a comment, it sounds like what you have really is just HTML. — abarnert
– abarnert, Commented May 6, 2015 at 23:32

Wiktor Stribiżew · Accepted Answer · 2015-05-06 23:01:42Z

2

This pattern should fetch you the string, just grab the Group 1 value:

r'\[span\b[^]]*class=[\\"\']*review-title\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'

Or a more generic one that does not check the class="review-link":

r'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'

Sample code at IDEONE:

import re
p = re.compile(ur'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b')
test_str = u"[span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" "
print re.search(p, test_str).group(1)

Output:

I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server.

EDIT: Since the [s and ]s are in fact <s and >s, here is an updated regex and code:

import re
p = re.compile(ur'<span\b[^>]*>[^<]*</span>\s*([^<]*)<div\b')
test_str = u"<span class=\"review-title\">Wont open</span> I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. <div class=\"review-link\" "
print [x.group(1) for x in re.finditer(p, test_str)]

A more specific regex to account for the class attribute:

p = re.compile(ur'<span\b[^>]*class\s*=\s*[\\\'"]*review-title[^>]*>[^<]*</span>\s*([^<]*)<div\b')

edited May 6, 2015 at 23:01

answered May 6, 2015 at 22:39

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Eran Over a year ago

hey it works great but just a little thing im having trouble to solve, the original [, ] are <, >. couldnt write it in the post. can you rewrite the regex please?

Eran Over a year ago

works perfect, but when i run it on the whole text, it just gives me the first result.. i need to take all those texts and write them into a string or list of strings

Wiktor Stribiżew Over a year ago

Use finditer, I have updated the EDIT section and the links to the corresponding demo programs.

Wiktor Stribiżew Over a year ago

:-D Always glad to help people out. Time to go to bed for me. Happy programming!

Community · Accepted Answer · 2017-05-23 12:28:34Z

1

From your comments ("im having trouble to solve, the original [, ] are <, >"), it's pretty clear that what you have is HTML.

Do not try to parse HTML with regex.

What you want here is an HTML parser. For example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(huge_string)
for span in soup.find_all('span', class='review-title'):
    text = span.next_sibling
    print(text)

Even if what you have is HTML escaped in some way (backslash-escaped quotes, angle brackets turned into square brackets, etc.), you still don't want to parse it with regex. In that case, at most, you might want to use a regex as the preprocessor to turn it back into HTML to feed to an HTML parser.

edited May 23, 2017 at 12:28

CommunityBot

11 silver badge

answered May 6, 2015 at 23:35

abarnert

368k54 gold badges626 silver badges691 bronze badges

Comments

Andie2302 · Accepted Answer · 2015-05-06 22:47:36Z

0

It seems that you need just this regex:

(?<=\[/span\])[\s\S]*?(?=\[div)

edited May 6, 2015 at 22:47

answered May 6, 2015 at 22:42

Andie2302

4,9254 gold badges26 silver badges46 bronze badges

Collectives™ on Stack Overflow

Advanced string parsing in python

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related