0

in a HTML-file, I've got the following occurrences:

<span class="finereader"></span>

or

<span class="finereader">a</span>

I'd like to remove all these tags. The second example shows that it's possible that there is a letter (or number, but only 1) under the tag. The letter shouldn't be removed, only <span class="finereader"> and the following </span>. Is there any re.sub-expression which can do this? Thanks for any help.

2 Answers 2

3

Another solution using BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('htmlfile'))

for elem in soup.find_all('span', class_='finereader'):
    elem.replace_with(elem.string or '') 

print(soup.prettify())
Sign up to request clarification or add additional context in comments.

4 Comments

Isn't it possible using strings or lxml? Because I worked with lxml...and if I understood correctly, BS is just an alternative to lxml, isn't it?
@MarkF6:BeautifulSoup can use several parsers, one of them lxml. Take a look to crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
Thanks a lot. This worked. But I've got one last problem: The layout which BS produces (with all the shiftings) isn't helpful to me; in fact, I'd like to have no shiftings at all. Is there a possibility to achieve this using BS?
@MarkF6: Use print(soup) instead of print(soup.prettify()).
1

You might want to look at beautifulsoup instead of using regular expressions for this task.

Then you can do something like this: (used a string in this example as a html file)

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
<title>Sample</title>
</head>
<body>
<span class="dummy">a</span>
<span>b</span>
</body>
</html>
"""
soup = BeautifulSoup(html_doc)
for span in soup.find_all('span'):
    print(span.string)

# output:
# a
# b

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.