How to remove only html tags in a string using javascript

Question

I want to remove html tags from given string using javascript. I looked into current approaches but there are some unsolved problems occured with them.

Current solutions

(1) Using javascript, creating virtual div tag and get the text

  function remove_tags(html)
  {
       var tmp = document.createElement("DIV");
       tmp.innerHTML = html; 
       return tmp.textContent||tmp.innerText; 
  }

(2) Using regex

  function remove_tags(html)
  {
       return html.replace(/<(?:.|\n)*?>/gm, '');
  }

(3) Using JQuery

  function remove_tags(html)
  {
       return jQuery(html).text();
  }

These three solutions are working correctly, but if the string is like this

  <div> hello <hi all !> </div>

stripped string is like hello . But I need only remove html tags only. like hello <hi all !>

Edited: Background is, I want to remove all the user input html tags for a particular text area. But I want to allow users to enter <hi all> kind of text. In current approach, its remove any content which include within <>.

If you want special parsing rules for invalid HTML, you will need to write a parser. Note that the last jQuery version is no different to the first, and a regular expression will not do the job for anything other than trivial input. — RobG
– RobG, Commented Jun 18, 2013 at 8:58
Additionally to RobG's comment: Maybe it would help if you'd explain the background, so that we can suggest better solutions. Why are you using JavaScript for this? Where is the HTML coming from that is invalid? — RoToRa
– RoToRa, Commented Jun 18, 2013 at 9:07
@RobG: I disagree, in this particular case. I think I have a fairly robust solution below, I'd appreciate your input. — Andy E
– Andy E, Commented Jun 18, 2013 at 10:39
@chacka Regarding your edit: You shouldn't use JavaScript for this. JavaScript is easily circumvented and removing dangerous HTML is important. Do it server-side for example using a markup library just as Stackoverflow does on this site. They will remove and/or escape any problematic HTML. — RoToRa
– RoToRa, Commented Jun 18, 2013 at 11:01
@RoToRa: Stack Overflow also has a live preview that is rendered using JavaScript. I agree, though, and common sense says to sanitize at the server before storing in the database or outputting to the page. — Andy E
– Andy E, Commented Jun 18, 2013 at 11:04

Andy E · Accepted Answer · 2013-06-18 14:57:42Z

7

Using a regex might not be a problem if you consider a different approach. For instance, looking for all tags, and then checking to see if the tag name matches a list of defined, valid HTML tag names:

var protos = document.body.constructor === window.HTMLBodyElement;
    validHTMLTags  =/^(?:a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdi|bdo|bgsound|big|blink|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|data|datalist|dd|del|details|dfn|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6|head|header|hgroup|hr|html|i|iframe|img|input|ins|isindex|kbd|keygen|label|legend|li|link|listing|main|map|mark|marquee|menu|menuitem|meta|meter|nav|nobr|noframes|noscript|object|ol|optgroup|option|output|p|param|plaintext|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|source|spacer|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video|wbr|xmp)$/i;

function sanitize(txt) {
    var // This regex normalises anything between quotes
        normaliseQuotes = /=(["'])(?=[^\1]*[<>])[^\1]*\1/g,
        normaliseFn = function ($0, q, sym) { 
            return $0.replace(/</g, '&lt;').replace(/>/g, '&gt;'); 
        },
        replaceInvalid = function ($0, tag, off, txt) {
            var 
                // Is it a valid tag?
                invalidTag = protos && 
                    document.createElement(tag) instanceof HTMLUnknownElement
                    || !validHTMLTags.test(tag),

                // Is the tag complete?
                isComplete = txt.slice(off+1).search(/^[^<]+>/) > -1;

            return invalidTag || !isComplete ? '&lt;' + tag : $0;
        };

    txt = txt.replace(normaliseQuotes, normaliseFn)
             .replace(/<(\w+)/g, replaceInvalid);

    var tmp = document.createElement("DIV");
    tmp.innerHTML = txt;

    return "textContent" in tmp ? tmp.textContent : tmp.innerHTML;
}

Working Demo: http://jsfiddle.net/m9vZg/3/

This works because browsers parse '>' as text if it isn't part of a matching '<' opening tag. It doesn't suffer the same problems as trying to parse HTML tags using a regular expression, because you're only looking for the opening delimiter and the tag name, everything else is irrelevant.

It's also future proof: the WebIDL specification tells vendors how to implement prototypes for HTML elements, so we try and create a HTML element from the current matching tag. If the element is an instance of HTMLUnknownElement, we know that it's not a valid HTML tag. The validHTMLTags regular expression defines a list of HTML tags for older browsers, such as IE 6 and 7, that do not implement these prototypes.

edited Jun 18, 2013 at 14:57

answered Jun 18, 2013 at 10:01

Andy E

346k86 gold badges482 silver badges452 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

georg Over a year ago

good idea! It would be simpler to use the negative lookeahead instead of replacing function. jsfiddle.net/m9vZg/2

Andy E Over a year ago

@thg435: you're right, but I was writing it with a better detection method in mind, which I just edited in ;-) Newer browsers don't use the validHTMLTags regex now.

RobG Over a year ago

Close, but fails for input like "foo<div", the result is "foo". You should only accept valid markup.

RobG Over a year ago

foo<div and bar> => "foo". There's no getting around it, you have to build a proper validating parser (that would be incompatible with current and past HTML specifications). You're getting there bit by bit. :-) It might be simpler to find non–standard tags, replace < with < and do the textContent/innerText thing.

RobG Over a year ago

The OP wants anything that isn't a valid tag displayed, I think it's a strange requirement since an HTML parser will not show anything that it thinks is a tag, even an invalid one, but it will show the content. The simple solution is to not have invalid tags in the first place, but the requirement is fix things on the client. Hence my suggestion to make invalid tags not tags at all but keep them looking like tags (the < to < thing) and leave it up to the HTML parser. That's my theory anyway. :-) I think you've gotten a lot closer than I expected.

|

georg · Accepted Answer · 2013-06-18 09:09:45Z

5

If you want to keep invalid markup untouched, regular expressions is your best bet. Something like this might work:

 text = html.replace(/<\/?(span|div|img|p...)\b[^<>]*>/g, "")

Expand (span|div|img|p...) into a list of all tags (or only those you want to remove). NB: the list must be sorted by length, longer tags first!

This may provide incorrect results in some edge cases (like attributes with <> characters), but the only real alternative would be to program a complete html parser by yourself. Not that it would be extremely complicated, but might be an overkill here. Let us know.

edited Jun 18, 2013 at 9:09

answered Jun 18, 2013 at 8:53

georg

216k57 gold badges324 silver badges401 bronze badges

1 Comment

RobG Over a year ago

Note that in HTML5, any character other than whitespace is valid in an ID, so I could have an ID of "foo>". What now?

Prashobh · Accepted Answer · 2013-06-18 08:53:51Z

1

var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,"");

answered Jun 18, 2013 at 8:53

Prashobh

9,58415 gold badges65 silver badges92 bronze badges

2 Comments

cp100 Over a year ago

This not working for none html tags. eg: if string is <div> hi <abc> </div>, then this regex will remove <abc> too.

RobG Over a year ago

Try it on <div id="foo>bar">foo bar</div>.

Human Being · Accepted Answer · 2014-01-24 13:06:21Z

0

Here is my solution ,

function removeTags(){
    var txt = document.getElementById('myString').value;
    var rex = /(<([^>]+)>)/ig;
    alert(txt.replace(rex , ""));

}

answered Jan 24, 2014 at 13:06

Human Being

8,44728 gold badges98 silver badges141 bronze badges

Comments

Purvik Dhorajiya · Accepted Answer · 2017-08-21 11:49:26Z

0

I use regular expression for preventing HTML tags in my textarea

Example

<form>
    <textarea class="box"></textarea>
    <button>Submit</button>
</form>
<script>
    $(".box").focusout( function(e) {
        var reg =/<(.|\n)*?>/g; 
        if (reg.test($('.box').val()) == true) {
            alert('HTML Tag are not allowed');
        }
        e.preventDefault();
    });
</script>

answered Aug 21, 2017 at 11:49

Purvik Dhorajiya

4,8903 gold badges37 silver badges45 bronze badges

Comments

Ngawang Zepa · Accepted Answer · 2017-11-02 12:05:18Z

0

<script type="text/javascript">
function removeHTMLTags() {           
var str="<html><p>I want to remove HTML tags</p></html>";
alert(str.replace(/<[^>]+>/g, ''));
    }</script>

answered Nov 2, 2017 at 12:05

Ngawang Zepa

1

1 Comment

Ngawang Zepa Over a year ago

Also refer this link for more detail : ourcodeworld.com/articles/read/376/…

Collectives™ on Stack Overflow

How to remove only html tags in a string using javascript

6 Answers 6

7 Comments

1 Comment

2 Comments

Comments

Example

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

7 Comments

1 Comment

2 Comments

Comments

Example

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related