7

I want to remove html tags from given string using javascript. I looked into current approaches but there are some unsolved problems occured with them.

Current solutions

(1) Using javascript, creating virtual div tag and get the text

  function remove_tags(html)
  {
       var tmp = document.createElement("DIV");
       tmp.innerHTML = html; 
       return tmp.textContent||tmp.innerText; 
  }

(2) Using regex

  function remove_tags(html)
  {
       return html.replace(/<(?:.|\n)*?>/gm, '');
  }

(3) Using JQuery

  function remove_tags(html)
  {
       return jQuery(html).text();
  }

These three solutions are working correctly, but if the string is like this

  <div> hello <hi all !> </div>

stripped string is like hello . But I need only remove html tags only. like hello <hi all !>

Edited: Background is, I want to remove all the user input html tags for a particular text area. But I want to allow users to enter <hi all> kind of text. In current approach, its remove any content which include within <>.

9
  • 4
    If you want special parsing rules for invalid HTML, you will need to write a parser. Note that the last jQuery version is no different to the first, and a regular expression will not do the job for anything other than trivial input. Commented Jun 18, 2013 at 8:58
  • 2
    Additionally to RobG's comment: Maybe it would help if you'd explain the background, so that we can suggest better solutions. Why are you using JavaScript for this? Where is the HTML coming from that is invalid? Commented Jun 18, 2013 at 9:07
  • @RobG: I disagree, in this particular case. I think I have a fairly robust solution below, I'd appreciate your input. Commented Jun 18, 2013 at 10:39
  • @chacka Regarding your edit: You shouldn't use JavaScript for this. JavaScript is easily circumvented and removing dangerous HTML is important. Do it server-side for example using a markup library just as Stackoverflow does on this site. They will remove and/or escape any problematic HTML. Commented Jun 18, 2013 at 11:01
  • @RoToRa: Stack Overflow also has a live preview that is rendered using JavaScript. I agree, though, and common sense says to sanitize at the server before storing in the database or outputting to the page. Commented Jun 18, 2013 at 11:04

6 Answers 6

7

Using a regex might not be a problem if you consider a different approach. For instance, looking for all tags, and then checking to see if the tag name matches a list of defined, valid HTML tag names:

var protos = document.body.constructor === window.HTMLBodyElement;
    validHTMLTags  =/^(?:a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdi|bdo|bgsound|big|blink|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|data|datalist|dd|del|details|dfn|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6|head|header|hgroup|hr|html|i|iframe|img|input|ins|isindex|kbd|keygen|label|legend|li|link|listing|main|map|mark|marquee|menu|menuitem|meta|meter|nav|nobr|noframes|noscript|object|ol|optgroup|option|output|p|param|plaintext|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|source|spacer|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video|wbr|xmp)$/i;

function sanitize(txt) {
    var // This regex normalises anything between quotes
        normaliseQuotes = /=(["'])(?=[^\1]*[<>])[^\1]*\1/g,
        normaliseFn = function ($0, q, sym) { 
            return $0.replace(/</g, '&lt;').replace(/>/g, '&gt;'); 
        },
        replaceInvalid = function ($0, tag, off, txt) {
            var 
                // Is it a valid tag?
                invalidTag = protos && 
                    document.createElement(tag) instanceof HTMLUnknownElement
                    || !validHTMLTags.test(tag),

                // Is the tag complete?
                isComplete = txt.slice(off+1).search(/^[^<]+>/) > -1;

            return invalidTag || !isComplete ? '&lt;' + tag : $0;
        };

    txt = txt.replace(normaliseQuotes, normaliseFn)
             .replace(/<(\w+)/g, replaceInvalid);

    var tmp = document.createElement("DIV");
    tmp.innerHTML = txt;

    return "textContent" in tmp ? tmp.textContent : tmp.innerHTML;
}

Working Demo: http://jsfiddle.net/m9vZg/3/

This works because browsers parse '>' as text if it isn't part of a matching '<' opening tag. It doesn't suffer the same problems as trying to parse HTML tags using a regular expression, because you're only looking for the opening delimiter and the tag name, everything else is irrelevant.

It's also future proof: the WebIDL specification tells vendors how to implement prototypes for HTML elements, so we try and create a HTML element from the current matching tag. If the element is an instance of HTMLUnknownElement, we know that it's not a valid HTML tag. The validHTMLTags regular expression defines a list of HTML tags for older browsers, such as IE 6 and 7, that do not implement these prototypes.

Sign up to request clarification or add additional context in comments.

7 Comments

good idea! It would be simpler to use the negative lookeahead instead of replacing function. jsfiddle.net/m9vZg/2
@thg435: you're right, but I was writing it with a better detection method in mind, which I just edited in ;-) Newer browsers don't use the validHTMLTags regex now.
Close, but fails for input like "foo<div", the result is "foo". You should only accept valid markup.
foo<div and bar> => "foo". There's no getting around it, you have to build a proper validating parser (that would be incompatible with current and past HTML specifications). You're getting there bit by bit. :-) It might be simpler to find non–standard tags, replace < with &lt; and do the textContent/innerText thing.
The OP wants anything that isn't a valid tag displayed, I think it's a strange requirement since an HTML parser will not show anything that it thinks is a tag, even an invalid one, but it will show the content. The simple solution is to not have invalid tags in the first place, but the requirement is fix things on the client. Hence my suggestion to make invalid tags not tags at all but keep them looking like tags (the < to &lt; thing) and leave it up to the HTML parser. That's my theory anyway. :-) I think you've gotten a lot closer than I expected.
|
5

If you want to keep invalid markup untouched, regular expressions is your best bet. Something like this might work:

 text = html.replace(/<\/?(span|div|img|p...)\b[^<>]*>/g, "")

Expand (span|div|img|p...) into a list of all tags (or only those you want to remove). NB: the list must be sorted by length, longer tags first!

This may provide incorrect results in some edge cases (like attributes with <> characters), but the only real alternative would be to program a complete html parser by yourself. Not that it would be extremely complicated, but might be an overkill here. Let us know.

1 Comment

Note that in HTML5, any character other than whitespace is valid in an ID, so I could have an ID of "foo>". What now?
1
var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,"");

2 Comments

This not working for none html tags. eg: if string is <div> hi <abc> </div>, then this regex will remove <abc> too.
Try it on <div id="foo>bar">foo bar</div>.
0

Here is my solution ,

function removeTags(){
    var txt = document.getElementById('myString').value;
    var rex = /(<([^>]+)>)/ig;
    alert(txt.replace(rex , ""));

}

Comments

0

I use regular expression for preventing HTML tags in my textarea

Example

<form>
    <textarea class="box"></textarea>
    <button>Submit</button>
</form>
<script>
    $(".box").focusout( function(e) {
        var reg =/<(.|\n)*?>/g; 
        if (reg.test($('.box').val()) == true) {
            alert('HTML Tag are not allowed');
        }
        e.preventDefault();
    });
</script>

Comments

0
<script type="text/javascript">
function removeHTMLTags() {           
var str="<html><p>I want to remove HTML tags</p></html>";
alert(str.replace(/<[^>]+>/g, ''));
    }</script>

1 Comment

Also refer this link for more detail : ourcodeworld.com/articles/read/376/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.