Replace or Remove HTML Tag & Content Python Regex

Question

I want to remove an HTML open and close and the content between the two tags with regular expressions. How can I remove the <head> tag in the following string.

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

So that it looks like this:

my_string = '''
<html>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

What have you tried so far? Why regex and not a proper HTML parser? — esqew
– esqew, Commented Apr 12, 2019 at 16:58
I tried re.sub() with this pattern pattern = r'(<head>)(.+?)(</head>)' to find the open head tag, any content between, and close head tag. — ElectroMotiveHorse
– ElectroMotiveHorse, Commented Apr 12, 2019 at 17:11

Pushpesh Kumar Rajwanshi · Accepted Answer · 2019-04-12 19:06:08Z

2

You can remove head tag from HTML text using Beautiful Soup in Python using decompose() function. Try this Python code,

from bs4 import BeautifulSoup

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

soup = BeautifulSoup(my_string)
soup.find('head').decompose()  # find head tag and decompose/destroy it from the html
print(soup)                    # prints html text without head tag

Prints,

<html>

<meta/>
<p>
        this is a different paragraph tag
        </p>
</html>

Also, although regex way is not recommended, but if the tag you want to remove isn't nested, you can remove it using the regex you mentioned in your comments using these Python codes. But always avoid using regex for parsing nested structures and go for a proper parser.

import re

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

print(re.sub(r'(?s)<head>.*?</head>', '', my_string))

Prints following and notice the usage of (?s) which is needed to enable dot matching newlines as your html is spread across multiple lines,

<html>

    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>

answered Apr 12, 2019 at 19:06

Pushpesh Kumar Rajwanshi

18.4k2 gold badges22 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ElectroMotiveHorse Over a year ago

Thank you your solution worked. I was not aware of the decompose method. This gets straight to exactly what i needed. The regex works as well, but the decompose method is more suitable.

Mr Lister Over a year ago

The problem with using a HTML parser is that it tries to treat the input as if it were HTML. As you can see in the output of the BeautifulSoup code, it puts the <p> after the <meta> rather than inside it, because that is how HTML works. But the OP's input is not HTML.

Collectives™ on Stack Overflow

Replace or Remove HTML Tag & Content Python Regex

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related