45

I've got some ugly HTML generated from Word, from which I want to strip all HTML comments.

The HTML looks like this:

<!--[if gte mso 9]><xml> <o:OfficeDocumentSettings> <o:RelyOnVML/> <o:AllowPNG/> </o:OfficeDocumentSettings> </xml><![endif]--><!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:TrackMoves/> <w:TrackFormatting/> <w:HyphenationZone>21</w:HyphenationZone> <w:PunctuationKerning/> <w:ValidateAgainstSchemas/> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:DoNotPromoteQF/> <w:LidThemeOther>NO-BOK</w:LidThemeOther> <w:LidThemeAsian>X-NONE</w:LidThemeAsian> <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript> <w:Compatibility> <w:BreakWrappedTables/> <w:SnapToGridInCell/> <w:WrapTextWithPunct/> <w:UseAsianBreakRules/> <w:DontGrowAutofit/> <w:SplitPgBreakAndParaMark/> <w:EnableOpenTypeKerning/> <w:DontFlipMirrorIndents/> <w:OverrideTableStyleHps/> </w:Compatibility> <m:mathPr> <m:mathFont m:val="Cambria Math"/> <m:brkBin m:val="before"/> <m:brkBinSub m:val="&#45;-"/> <m:smallFrac m:val="off"/> <m:dispDef/> <m:lMargin m:val="0"/> <m:rMargin m:val="0"/> <m:defJc m:val="centerGroup"/> <m:wrapIndent m:val="1440"/> <m:intLim m:val="subSup"/> <m:naryLim m:val="undOvr"/> </m:mathPr></w:WordDocument> </xml><![endif]-->

..and the regex I am using is this one

html = html.replace(/<!--(.*?)-->/gm, "")

But there seems to be no match, the string is unchanged.

What I am missing?

7
  • 4
    Works for me. Check jsfiddle.net/aQ5qp Commented Apr 13, 2011 at 17:37
  • The whole string is a comment hence everything is replaced by "" Commented Apr 13, 2011 at 17:42
  • 1
    As @Cybernate says, the regex does work on that text, so what gives? All the responders are assuming there are newlines in the text, which would explain the problem, but I don't see any newlines. Commented Apr 13, 2011 at 21:42
  • possible duplicate of is it possible to remove an html comment from dom using jquery Commented Jul 4, 2014 at 10:13
  • 1
    stackoverflow.com/questions/1732348/… Commented Jul 24, 2017 at 17:12

7 Answers 7

92

The regex /<!--[\s\S]*?-->/g should work.

You're going to kill escaping text spans in CDATA blocks.

E.g.

<script><!-- notACommentHere() --></script>

and literal text in formatted code blocks

<xmp>I'm demoing HTML <!-- comments --></xmp>

<textarea><!-- Not a comment either --></textarea>

EDIT:

This also won't prevent new comments from being introduced as in

<!-<!-- A comment -->- not comment text -->

which after one round of that regexp would become

<!-- not comment text -->

If this is a problem, you can escape < that are not part of a comment or tag (complicated to get right) or you can loop and replace as above until the string settles down.


Here's a regex that will match comments including psuedo-comments and unclosed comments per the HTML-5 spec. The CDATA section are only strictly allowed in foreign XML. This suffers the same caveats as above.

var COMMENT_PSEUDO_COMMENT_OR_LT_BANG = new RegExp(
    '<!--[\\s\\S]*?(?:-->)?'
    + '<!---+>?'  // A comment with no body
    + '|<!(?![dD][oO][cC][tT][yY][pP][eE]|\\[CDATA\\[)[^>]*>?'
    + '|<[?][^>]*>?',  // A pseudo-comment
    'g');
Sign up to request clarification or add additional context in comments.

8 Comments

How would you modify this to get the comments only, and remove the html ?
@guiomie, I don't understand your goal. Please explain in more detail?
@MikeSamuel Your last code snippet misses two backslash escapes around the CDATA.
Example usage please?
This is a nice solution thanks. I am trying to modify it to catch comments that are spread across multiple lines. Any ideas on that? Worth a new question?
|
10

This is based off Aurielle Perlmann's answer, it supports all cases (single-line, multi-line, un-terminated, and nested comments):

/(<!--.*?-->)|(<!--[\S\s]+?-->)|(<!--[\S\s]*?$)/g

https://regex101.com/r/az8Lu6/1

regex101 output

Comments

4

You should use the /s modifier

html = html.replace(/<!--.*?-->/sg, "")

Tested in perl:

use strict;
use warnings;

my $str = 'hello <!--[if gte mso 9]><xml> <o:OfficeDocumentSettings> <o:RelyOnVML/> <o:AllowPNG/> </o:OfficeDocumentSettings> </xml><![endif]--><!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:TrackMoves/> <w:TrackFormatting/> <w:HyphenationZone>21</w:HyphenationZone> <w:PunctuationKerning/> <w:ValidateAgainstSchemas/> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:DoNotPromoteQF/> <w:LidThemeOther>NO-BOK</w:LidThemeOther> <w:LidThemeAsian>X-NONE</w:LidThemeAsian> <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript> <w:Compatibility> <w:BreakWrappedTables/> <w:SnapToGridInCell/> <w:WrapTextWithPunct/> <w:UseAsianBreakRules/> <w:DontGrowAutofit/> <w:SplitPgBreakAndParaMark/> <w:EnableOpenTypeKerning/> <w:DontFlipMirrorIndents/> <w:OverrideTableStyleHps/> </w:Compatibility> <m:mathPr> <m:mathFont m:val="Cambria Math"/> <m:brkBin m:val="before"/> <m:brkBinSub m:val="&#45;-"/> <m:smallFrac m:val="off"/> <m:dispDef/> <m:lMargin m:val="0"/> <m:rMargin m:val="0"/> <m:defJc m:val="centerGroup"/> <m:wrapIndent m:val="1440"/> <m:intLim m:val="subSup"/> <m:naryLim m:val="undOvr"/> </m:mathPr></w:WordDocument> </xml><![endif]-->world!';

$str =~ s/<!--.*?-->//sg;
print $str;

Output:
hello world!

1 Comment

JavaScript does not have an s modifier. Use [\s\S] instead of ..
4

this works also for multiline - (<!--.*?-->)|(<!--[\w\W\n\s]+?-->)

enter image description here

Comments

3

I recently needed to do this very thing (i.e. Remove all comments from a html file). Some things that these other answers don't take into consideration;

  1. An html file can have css and JS inline, which, well I wanted to strip at least
  2. Comment syntax while inside a string or regex is totally valid. (My string/regex exclusion pattern is based on: https://stackoverflow.com/a/23667311/3799617)

TLDR: (I just want the regex that removes all the comments, plz)

/\\\/|\/\s*(?:\\\/|[^\/\*\n])+\/|\\"|"(?:\\"|[^"])*"|\\'|'(?:\\'|[^'])*'|\\`|`(?:\\`|[^`])*`|(\/\/[\s\S]*?$|(?:<!--|\/\s*\*)\s*[\s\S]*?\s*(?:-->|\*\s*\/))/gm

And here is a simple demo: https://www.regexr.com/5fjlu


I don't hate reading, show me the rest:

I also needed to do various other matching that took into account valid strings containing things that otherwise appear as valid targets. So I made a class to handle my variety of uses.

class StringAwareRegExp extends RegExp {
    static get [Symbol.species]() { return RegExp; }

    constructor(regex, flags){
        if(regex instanceof RegExp) regex = StringAwareRegExp.prototype.regExpToInnerRegexString(regex);

        regex = super(`${StringAwareRegExp.prototype.disqualifyStringsRegExp}(${regex})`, flags);

        return regex;
    }

    stringReplace(sourceString, replaceString = ''){
        return sourceString.replace(this, (match, group1) => { return group1 === undefined ? match : replaceString; });
    }
}

StringAwareRegExp.prototype.regExpToInnerRegexString = function(regExp){ return regExp.toString().replace(/^\/|\/[gimsuy]*$/g, ''); };
Object.defineProperty(StringAwareRegExp.prototype, 'disqualifyStringsRegExp', {
    get: function(){
        return StringAwareRegExp.prototype.regExpToInnerRegexString(/\\\/|\/\s*(?:\\\/|[^\/\*\n])+\/|\\"|"(?:\\"|[^"])*"|\\'|'(?:\\'|[^'])*'|\\`|`(?:\\`|[^`])*`|/);
    }
});

From this I created two more classes to hone in on the 2 major types of matches I needed:

class CommentRegExp extends StringAwareRegExp {
    constructor(regex, flags){
        if(regex instanceof RegExp) regex = StringAwareRegExp.prototype.regExpToInnerRegexString(regex);

        return super(`\\/\\/${regex}$|(?:<!--|\\/\\s*\\*)\\s*${regex}\\s*(?:-->|\\*\\s*\\/)`, flags);
    }
}

class StatementRegExp extends StringAwareRegExp {
    constructor(regex, flags){
        if(regex instanceof RegExp) regex = StringAwareRegExp.prototype.regExpToInnerRegexString(regex);

        return super(`${regex}\\s*;?\\s*?`, flags);
    }
}

And finally (however useful it may be to whomever) the regex created from this:

const allCommentsRegex = new CommentRegExp(/[\s\S]*?/, 'gm');
const enableBabelRegex = new CommentRegExp(/enable-?_?\s?babel/, 'gmi');
const disableBabelRegex = new CommentRegExp(/disable-?_?\s?babel/, 'gmi');
const includeRegex = new CommentRegExp(/\s*(?:includes?|imports?|requires?)\s+(.+?)/, 'gm');
const importRegex = new StatementRegExp(/import\s+(?:(?:\w+|{(?:\s*\w\s*,?\s*)+})\s+from)?\s*['"`](.+?)['"`]/, 'gm');
const requireRegex = new StatementRegExp(/(?:var|let|const)\s+(?:(?:\w+|{(?:\s*\w\s*,?\s*)+}))\s*=\s*require\s*\(\s*['"`](.+?)['"`]\s*\)/, 'gm');
const atImportRegex = new StatementRegExp(/@import\s*['"`](.+?)['"`]/, 'gm');

And lastly, if anyone cares to see it in use. Here's the project I used it in (..My personal projects are always a WIP..): https://github.com/fatlard1993/page-compiler

Comments

0
html = html.replace("(?s)<!--\\[if(.*?)\\[endif\\] *-->", "")

1 Comment

JavaScript doesn't support the s modifier, nor does it support inline-modifier syntax, like (?s).
0

const regex = /<!--(.*?)-->/gm;
const str = `You will be able to see this text. <!-- You will not be able to see this text. --> You can even comment out things in <!-- the middle of --> a sentence. <!-- Or you can comment out a large number of lines. --> <div class="example-class"> <!-- Another --> thing you can do is put comments after closing tags, to help you find where a particular element ends. <br> (This can be helpful if you have a lot of nested elements.) </div> <!-- /.example-class -->`;
const subst = ``;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

1 Comment

Please explain why you are using the m pattern modifier.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.