1

I have a string like:

"<p>
<style type=""text/css"">
P { margin-bottom: 0.08in; direction: ltr; widows: 2; orphans: 2; }A:link { color: rgb(0, 0, 255); }    </style>
</p>
<p style=""font-variant: normal; font-style: normal; font-weight: normal"">
    <font face=""Trebuchet MS, Arial, Verdana, sans-serif""><span   style=""font-size: 12px; background-color: rgb(238, 238, 238);"">blablabla. </span></font></p>
<p style=""font-variant: normal; font-style: normal; font-weight: normal"">
<font face=""Trebuchet MS, Arial, Verdana, sans-serif""><span style=""font-size: 12px; background-color: rgb(238, 238, 238);"">tjatjatja</span></font><span style=""font-family: 'Trebuchet MS', Arial, Verdana, sans-serif; font-size: 12px; background-color: rgb(238, 238, 238);"">tjetjetje</span><span style=""font-size: 12px; font-family: 'Trebuchet MS', Arial, Verdana, sans-serif; background-color: rgb(238, 238, 238);"">.</span></p>
<p style=""font-variant: normal; font-style: normal; font-weight: normal"">
<span style=""font-family: 'Trebuchet MS', Arial, Verdana, sans-serif; font-size: 12px; background-color: rgb(238, 238, 238);"">huehuehue</span></p>
"

I want to remove the first style tag and its' content. I have a regex like:

([\s\S]*)<style type=""text\/css"">[\s\S]+<\/style>([\s\S]*)

which matches just the first style tag but when I try to remove it in python with:

re.sub(r'([\s\S]*)<style type=""text/css"">[\s\S]*</style>([\s\S]*)', r'\1\2', cell_text, flags=re.M)

it doesn't work. I think it's either to do with the groups or with the string being multiline. Any ideas?

3
  • You'll at least have to make the [\s\S]* non-greedy ([\s\S]*?), in case more style tags are possible. Commented Jun 21, 2016 at 12:05
  • And... I'm no python expert, but your regex has single quotes - why 2 " in it? I'm guessing the string has 2 because that's how you escape quotes in python, but that shouldn't be necessary in a single quoted string, or...? Commented Jun 21, 2016 at 12:09
  • Not sure why the example data contains quotes. To counter it I used single-quotes for the raw strings containing regex. Commented Jun 21, 2016 at 12:59

2 Answers 2

3

Use a Parser instead:

from bs4 import BeautifulSoup

string = """
<p>
<style type=""text/css"">
P { margin-bottom: 0.08in; direction: ltr; widows: 2; orphans: 2; }A:link { color: rgb(0, 0, 255); }    </style>
</p>
<p style=""font-variant: normal; font-style: normal; font-weight: normal"">
    <font face=""Trebuchet MS, Arial, Verdana, sans-serif""><span   style=""font-size: 12px; background-color: rgb(238, 238, 238);"">blablabla. </span></font></p>
<p style=""font-variant: normal; font-style: normal; font-weight: normal"">
<font face=""Trebuchet MS, Arial, Verdana, sans-serif""><span style=""font-size: 12px; background-color: rgb(238, 238, 238);"">tjatjatja</span></font><span style=""font-family: 'Trebuchet MS', Arial, Verdana, sans-serif; font-size: 12px; background-color: rgb(238, 238, 238);"">tjetjetje</span><span style=""font-size: 12px; font-family: 'Trebuchet MS', Arial, Verdana, sans-serif; background-color: rgb(238, 238, 238);"">.</span></p>
<p style=""font-variant: normal; font-style: normal; font-weight: normal"">
<span style=""font-family: 'Trebuchet MS', Arial, Verdana, sans-serif; font-size: 12px; background-color: rgb(238, 238, 238);"">huehuehue</span></p>
"""

soup = BeautifulSoup(string)
[s.extract() for s in soup('style')]
print soup
Sign up to request clarification or add additional context in comments.

3 Comments

I was thinking about doing this since Beautifulsoup was already imported. Your solution works fantastically! Thanks!
@tjarles: Glad to help :)
@Jan: I think the downvote is the same as this one I got. Some people do not like the design. Ignoring the fact that this actually helps people in specific situations.
0

To remove the CSS with regex use this Regular Expression code:

(?s)<style>(.*?)<\/style>

To do the replacement in Python with the 're' library do something like this:

regex = '(?s)<style>(.*?)<\/style>'
pattern = re.compile(regex)
re.sub(pattern, whatYouWantToReplaceItWith, stringToReplace)

Here is a tutorial for using the 're' library in Python: http://www.tutorialspoint.com/python/python_reg_expressions.htm

1 Comment

The link is not using the string from my example. If you take that string and the regex example and input it it matches the first tag. It's the substitution part I'm unsure about.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.