5

i am trying to find a way using javascript or jquery to write a function which remove all the html tags from a page and just give me the plain text of this page.

How this can be done? any ideas?

2
  • Do you want a string that returns the text content of <body>, then? Commented Jun 3, 2010 at 14:23
  • 1
    This looks useful: stackoverflow.com/questions/822452/… Commented Jun 3, 2010 at 14:24

6 Answers 6

9

IE & WebKit

document.body.innerText

Others:

document.body.textContent

(as suggested by Amr ElGarhy)

Most js frameworks implement a crossbrowser way to do this. This is usually implemented somewhat like this:

text = document.body.textContent || document.body.innerText;

It seems that WebKit keeps some formatting with textContent whereas strips everything with innerText.

Sign up to request clarification or add additional context in comments.

5 Comments

I think that only works in internet explorer
It works just fine in my WebKit.
Indeed, only Firefox seems to give problems. But in Opera it still gives me HTML tags when printing innerText
use document.body.textContent in other browsers
your answer is complete and covered everything i wanted, thanks
3

It depends on how much formatting you want to keep. But with jQuery you can do it like this:

jQuery(document.body).text();

Comments

2

The only trouble with textContent or innerText is that they can jam the text from adjacent nodes together, without any white space between them.

If that matters, you can curse through the body or other container and return the text in an array, and join them with spaces or newlines.

document.deepText= function(hoo){
    var A= [], tem, tx;
    if(hoo){
        hoo= hoo.firstChild;
        while(hoo!= null){
            if(hoo.nodeType== 3){
                tx= hoo.data || '';
                if(/\S/.test(tx)) A[A.length]= tx;
            }
            else A= A.concat(document.deepText(hoo));
            hoo= hoo.nextSibling;
        }
    }
    return A;
}
alert(document.deepText(document.body).join(' '))
// return document.deepText(document.body).join('\n')

1 Comment

It might be a good idea to add nodeType of 4 as well (CDATA) just in case someone wraps their text in it. (This is how jQuery does it at least.)
1

I had to convert rich text in an HTML email to plain text. The following worked for me in IE (obj is a jQuery object):

function getTextFromHTML(obj) {
    var ni = document.createNodeIterator(obj[0], NodeFilter.SHOW_TEXT, null, false);
    var nodeLine = ni.nextNode();   // go to first node of our NodeIterator
    var plainText = "";

    while (nodeLine) {
        plainText += nodeLine.nodeValue + "\n";
        nodeLine = ni.nextNode();
    }

    return plainText;
 }

Comments

0

Use htmlClean.

Comments

0

I would use:

<script language="javascript" type="text/javascript" src="http://code.jquery.com/jquery-1.4.2.js"></script>
<script type="text/javascript">
    jQuery.fn.stripTags = function() { return this.replaceWith( this.html().replace(/<\/?[^>]+>/gi, '') ); };
    jQuery('head').stripTags();

    $(document).ready(function() {
        $("img").each(function() {
            jQuery(this).remove();
        });
    });
</script>

This will not release any styles, but will strip all tags out.

Is that what you wanted?

[EDIT] now edited to include removal of image tags[/EDIT]

1 Comment

Thou shalt not attempt to parse HTML with regular expressions.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.