How to remove HTML comments using Regex in Python

Question

I want to remove HTML comments from an html text

<h1>heading</h1> <!-- comment-with-hyphen --> some text <-- con --> more text <hello></hello> more text

should result in:

<h1>heading</h1> some text <-- con --> more text <hello></hello> more text

Using regular expressions on a limited, known set of HTML may be appropriate. However, you should be aware that there are countless cases where it will break and it is generally not advised. — grc
– grc, Commented Jan 29, 2015 at 6:38
Why the downvotes on the question? If you are working on a "known set of HTML" this was a legit question. — Rushabh Mehta
– Rushabh Mehta, Commented Jan 30, 2015 at 7:24
Consider using a HTML specific library like Beatiful Soup, like this other question-solutions suggests: stackoverflow.com/questions/23299557/… — hectorcanto
– hectorcanto, Commented Apr 22, 2020 at 0:39

Wiktor Stribiżew · Accepted Answer · 2019-01-14 12:04:20Z

12

You shouldn't ignore Carriage return.

re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)

edited Jan 14, 2019 at 12:04

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

answered Jan 29, 2015 at 6:41

John Hua

1,47611 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ethan Over a year ago

Why shouldn't we remove the carriage returns as well?

Greg Lindahl Over a year ago

huazhihao's answer matches comments that have carriage returns within the comment. One of the other answers lacks flags=re.MULTILINE

fjsj Over a year ago

actually should be re.DOTALL, not re.MULTILINE. It's re.DOTALL who matches \n on .

Shawn · Accepted Answer · 2017-08-10 19:34:22Z

5

html = re.sub(r"<!--(.|\s|\n)*?-->", "", html)

re.sub basically find the matching instance and replace with the second arguments. For this case,  matches anything start with . The dot and ? means anything, and the \s and \n add the cases of muti line comment.

edited Aug 10, 2017 at 19:34

answered Aug 10, 2017 at 16:44

Shawn

6119 silver badges8 bronze badges

1 Comment

jpaugh Over a year ago

Welcome to Stack Overflow! If the OP could understand your code by itself, he probably would not be asking. Please explain what it does, so that it provides value for those who would need to look up a regex.

Rushabh Mehta · Accepted Answer · 2015-01-29 06:22:04Z

3

Finally came up with this option:

re.sub("()", "", t)

Adding the ? makes the search non-greedy and does not combine multiple comment tags.

answered Jan 29, 2015 at 6:22

Rushabh Mehta

1,59317 silver badges16 bronze badges

Comments

Iskren · Accepted Answer · 2015-01-29 09:14:49Z

2

Don't use regex. Use an XML parser instead, the one in the standard library is more than sufficient.

from xml.etree import ElementTree as ET
html = ET.parse("comments.html")
ET.dump(html) # Dumps to stdout
ET.write("no-comments.html", method="html") # Write to a file

answered Jan 29, 2015 at 9:14

Iskren

1,34110 silver badges15 bronze badges

1 Comment

Greg Lindahl Over a year ago

While this is good advice, the performance of XML parsers is much, much, much slower than this sort of regex.

Dmitry Mottl · Accepted Answer · 2018-08-11 11:05:27Z

1

re.sub("(?s)<!--.+?-->", "", s)

or

re.sub("<!--.+?-->", "", s, flags=re.DOTALL)

answered Aug 11, 2018 at 11:05

Dmitry Mottl

89210 silver badges17 bronze badges

Comments

dragon2fly · Accepted Answer · 2015-01-29 06:36:26Z

0

You could try this regex <![^<]*>

answered Jan 29, 2015 at 6:36

dragon2fly

2,45921 silver badges24 bronze badges

3 Comments

Greg Lindahl Over a year ago

Your regex matches too much -- note that the question has an example "<-- con -->", which is not an HTML comment.

dragon2fly Over a year ago

@GregLindahl this regex didn't match "<-- con -->" and returned the result as the OP expected.

k-den Over a year ago

This won't match a comment with an HTML tag inside of it, like

Collectives™ on Stack Overflow

How to remove HTML comments using Regex in Python

6 Answers 6

3 Comments

1 Comment

Comments

1 Comment

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

1 Comment

Comments

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related