11

I want to remove HTML comments from an html text

<h1>heading</h1> <!-- comment-with-hyphen --> some text <-- con --> more text <hello></hello> more text

should result in:

<h1>heading</h1> some text <-- con --> more text <hello></hello> more text
4
  • Using regular expressions on a limited, known set of HTML may be appropriate. However, you should be aware that there are countless cases where it will break and it is generally not advised. Commented Jan 29, 2015 at 6:38
  • Related: stackoverflow.com/a/1732454/3001761 Commented Jan 29, 2015 at 7:57
  • Why the downvotes on the question? If you are working on a "known set of HTML" this was a legit question. Commented Jan 30, 2015 at 7:24
  • Consider using a HTML specific library like Beatiful Soup, like this other question-solutions suggests: stackoverflow.com/questions/23299557/… Commented Apr 22, 2020 at 0:39

6 Answers 6

12

You shouldn't ignore Carriage return.

re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)
Sign up to request clarification or add additional context in comments.

3 Comments

Why shouldn't we remove the carriage returns as well?
huazhihao's answer matches comments that have carriage returns within the comment. One of the other answers lacks flags=re.MULTILINE
actually should be re.DOTALL, not re.MULTILINE. It's re.DOTALL who matches \n on .
5
html = re.sub(r"<!--(.|\s|\n)*?-->", "", html)

re.sub basically find the matching instance and replace with the second arguments. For this case, <!--(.|\s|\n)*?--> matches anything start with <!-- and end with -->. The dot and ? means anything, and the \s and \n add the cases of muti line comment.

1 Comment

Welcome to Stack Overflow! If the OP could understand your code by itself, he probably would not be asking. Please explain what it does, so that it provides value for those who would need to look up a regex.
3

Finally came up with this option:

re.sub("(<!--.*?-->)", "", t)

Adding the ? makes the search non-greedy and does not combine multiple comment tags.

Comments

2

Don't use regex. Use an XML parser instead, the one in the standard library is more than sufficient.

from xml.etree import ElementTree as ET
html = ET.parse("comments.html")
ET.dump(html) # Dumps to stdout
ET.write("no-comments.html", method="html") # Write to a file

1 Comment

While this is good advice, the performance of XML parsers is much, much, much slower than this sort of regex.
1
re.sub("(?s)<!--.+?-->", "", s)

or

re.sub("<!--.+?-->", "", s, flags=re.DOTALL)

Comments

0

You could try this regex <![^<]*>

3 Comments

Your regex matches too much -- note that the question has an example "<-- con -->", which is not an HTML comment.
@GregLindahl this regex didn't match "<-- con -->" and returned the result as the OP expected.
This won't match a comment with an HTML tag inside of it, like <!-- <br/> -->

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.