0

I want to remove an HTML open and close and the content between the two tags with regular expressions. How can I remove the <head> tag in the following string.

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

So that it looks like this:

my_string = '''
<html>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''
4
  • You need an HTML parser. Commented Apr 12, 2019 at 16:58
  • What have you tried so far? Why regex and not a proper HTML parser? Commented Apr 12, 2019 at 16:58
  • I tried re.sub() with this pattern pattern = r'(<head>)(.+?)(</head>)' to find the open head tag, any content between, and close head tag. Commented Apr 12, 2019 at 17:11
  • This input isn't really HTML. Commented Apr 14, 2019 at 18:27

1 Answer 1

2

You can remove head tag from HTML text using Beautiful Soup in Python using decompose() function. Try this Python code,

from bs4 import BeautifulSoup

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

soup = BeautifulSoup(my_string)
soup.find('head').decompose()  # find head tag and decompose/destroy it from the html
print(soup)                    # prints html text without head tag

Prints,

<html>

<meta/>
<p>
        this is a different paragraph tag
        </p>
</html>

Also, although regex way is not recommended, but if the tag you want to remove isn't nested, you can remove it using the regex you mentioned in your comments using these Python codes. But always avoid using regex for parsing nested structures and go for a proper parser.

import re

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

print(re.sub(r'(?s)<head>.*?</head>', '', my_string))

Prints following and notice the usage of (?s) which is needed to enable dot matching newlines as your html is spread across multiple lines,

<html>

    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you your solution worked. I was not aware of the decompose method. This gets straight to exactly what i needed. The regex works as well, but the decompose method is more suitable.
The problem with using a HTML parser is that it tries to treat the input as if it were HTML. As you can see in the output of the BeautifulSoup code, it puts the <p> after the <meta> rather than inside it, because that is how HTML works. But the OP's input is not HTML.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.