0

I've encountered a problem while trying to parse a complicated string. The string is really long and full of patterns but lets focus on what i need to take (and only that).

A substring from the huge string is:

... [span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" ...

Now I want to take the bold italic text, and i have the pattern, starts with [span class = ..]*[/span] desired text [div ... ] and this pattern repeates through the whole string.

How exactly do I take this specific text from the whole string and write it line after line?

2
  • Do you really want to parse this with regex? It looks like it's just HTML with the angle brackets changed into square brackets and the quotes escaped, and the same reasons that make regex bad for HTML will almost certainly make regex bad for this language. Commented May 6, 2015 at 23:31
  • Actually, from a comment, it sounds like what you have really is just HTML. Commented May 6, 2015 at 23:32

3 Answers 3

2

This pattern should fetch you the string, just grab the Group 1 value:

r'\[span\b[^]]*class=[\\"\']*review-title\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'

Or a more generic one that does not check the class="review-link":

r'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'

Sample code at IDEONE:

import re
p = re.compile(ur'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b')
test_str = u"[span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" "
print re.search(p, test_str).group(1)

Output:

I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server.

EDIT: Since the [s and ]s are in fact <s and >s, here is an updated regex and code:

import re
p = re.compile(ur'<span\b[^>]*>[^<]*</span>\s*([^<]*)<div\b')
test_str = u"<span class=\"review-title\">Wont open</span> I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. <div class=\"review-link\" "
print [x.group(1) for x in re.finditer(p, test_str)]

A more specific regex to account for the class attribute:

p = re.compile(ur'<span\b[^>]*class\s*=\s*[\\\'"]*review-title[^>]*>[^<]*</span>\s*([^<]*)<div\b')
Sign up to request clarification or add additional context in comments.

4 Comments

hey it works great but just a little thing im having trouble to solve, the original [, ] are <, >. couldnt write it in the post. can you rewrite the regex please?
works perfect, but when i run it on the whole text, it just gives me the first result.. i need to take all those texts and write them into a string or list of strings
Use finditer, I have updated the EDIT section and the links to the corresponding demo programs.
:-D Always glad to help people out. Time to go to bed for me. Happy programming!
1

From your comments ("im having trouble to solve, the original [, ] are <, >"), it's pretty clear that what you have is HTML.

Do not try to parse HTML with regex.

What you want here is an HTML parser. For example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(huge_string)
for span in soup.find_all('span', class='review-title'):
    text = span.next_sibling
    print(text)

Even if what you have is HTML escaped in some way (backslash-escaped quotes, angle brackets turned into square brackets, etc.), you still don't want to parse it with regex. In that case, at most, you might want to use a regex as the preprocessor to turn it back into HTML to feed to an HTML parser.

Comments

0

It seems that you need just this regex:

(?<=\[/span\])[\s\S]*?(?=\[div)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.