How to get a webpage as plain text without any html using javascript? [duplicate]

Question

i am trying to find a way using javascript or jquery to write a function which remove all the html tags from a page and just give me the plain text of this page.

How this can be done? any ideas?

Do you want a string that returns the text content of <body>, then? — Matchu
– Matchu, Commented Jun 3, 2010 at 14:23

Jakub Hampl · Accepted Answer · 2010-06-03 14:42:21Z

9

IE & WebKit

document.body.innerText

Others:

document.body.textContent

(as suggested by Amr ElGarhy)

Most js frameworks implement a crossbrowser way to do this. This is usually implemented somewhat like this:

text = document.body.textContent || document.body.innerText;

It seems that WebKit keeps some formatting with textContent whereas strips everything with innerText.

edited Jun 3, 2010 at 14:42

answered Jun 3, 2010 at 14:25

Jakub Hampl

40.7k10 gold badges80 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Wolph Over a year ago

I think that only works in internet explorer

Jakub Hampl Over a year ago

It works just fine in my WebKit.

Wolph Over a year ago

Indeed, only Firefox seems to give problems. But in Opera it still gives me HTML tags when printing innerText

Amr Elgarhy Over a year ago

use document.body.textContent in other browsers

Amr Elgarhy Over a year ago

your answer is complete and covered everything i wanted, thanks

Wolph · Accepted Answer · 2010-06-03 14:25:27Z

3

It depends on how much formatting you want to keep. But with jQuery you can do it like this:

jQuery(document.body).text();

answered Jun 3, 2010 at 14:25

Wolph

80.4k12 gold badges142 silver badges152 bronze badges

Comments

kennebec · Accepted Answer · 2010-06-03 15:02:32Z

2

The only trouble with textContent or innerText is that they can jam the text from adjacent nodes together, without any white space between them.

If that matters, you can curse through the body or other container and return the text in an array, and join them with spaces or newlines.

document.deepText= function(hoo){
    var A= [], tem, tx;
    if(hoo){
        hoo= hoo.firstChild;
        while(hoo!= null){
            if(hoo.nodeType== 3){
                tx= hoo.data || '';
                if(/\S/.test(tx)) A[A.length]= tx;
            }
            else A= A.concat(document.deepText(hoo));
            hoo= hoo.nextSibling;
        }
    }
    return A;
}
alert(document.deepText(document.body).join(' '))
// return document.deepText(document.body).join('\n')

answered Jun 3, 2010 at 15:02

kennebec

105k32 gold badges109 silver badges127 bronze badges

1 Comment

Jakub Hampl Over a year ago

It might be a good idea to add nodeType of 4 as well (CDATA) just in case someone wraps their text in it. (This is how jQuery does it at least.)

camster · Accepted Answer · 2012-08-03 01:47:14Z

1

I had to convert rich text in an HTML email to plain text. The following worked for me in IE (obj is a jQuery object):

function getTextFromHTML(obj) {
    var ni = document.createNodeIterator(obj[0], NodeFilter.SHOW_TEXT, null, false);
    var nodeLine = ni.nextNode();   // go to first node of our NodeIterator
    var plainText = "";

    while (nodeLine) {
        plainText += nodeLine.nodeValue + "\n";
        nodeLine = ni.nextNode();
    }

    return plainText;
 }

answered Aug 3, 2012 at 1:47

camster

6767 silver badges10 bronze badges

Comments

mcandre · Accepted Answer · 2010-06-03 14:24:35Z

0

Use htmlClean.

answered Jun 3, 2010 at 14:24

mcandre

25k21 gold badges93 silver badges150 bronze badges

Comments

Barrie Reader · Accepted Answer · 2010-06-03 14:31:39Z

0

I would use:

<script language="javascript" type="text/javascript" src="http://code.jquery.com/jquery-1.4.2.js"></script>
<script type="text/javascript">
    jQuery.fn.stripTags = function() { return this.replaceWith( this.html().replace(/<\/?[^>]+>/gi, '') ); };
    jQuery('head').stripTags();

    $(document).ready(function() {
        $("img").each(function() {
            jQuery(this).remove();
        });
    });
</script>

This will not release any styles, but will strip all tags out.

Is that what you wanted?

[EDIT] now edited to include removal of image tags[/EDIT]

answered Jun 3, 2010 at 14:31

Barrie Reader

10.7k11 gold badges77 silver badges141 bronze badges

1 Comment

Pointy Over a year ago

Thou shalt not attempt to parse HTML with regular expressions.

Collectives™ on Stack Overflow

How to get a webpage as plain text without any html using javascript? [duplicate]

6 Answers 6

5 Comments

Comments

1 Comment

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

5 Comments

Comments

1 Comment

Comments

Comments

1 Comment

Linked

Related