Extracting <BODY> text of HTML document in node js using REGEX [duplicate]

Question

I am writing a code to extract all the plain content from the tag of the HTML code.I know it can be done using the document element. But I need to do this using REGEX only I have written the following code, but it has some bugs which I am not able to figure out on how to solve it.

function htmlToText(html) {
      return html.
        replace(/(.|\n)*<body.*>/, ''). //remove up till body
        replace(/<\/body(.|\n)*/, ''). //remove from </body
        replace(/<.+\>/, ''). //remove tags
        replace(/^\s\n*$/gm, '');  //remove empty lines
    }

Here is the solution for it

function htmlToText(html) {
          return html.
            replace(/(.|\n)*<body.*>/, ''). //remove up till body
            replace(/<\/body(.|\n)*/g, ''). //remove from </body
            replace(/<.+\>/g, ''). //remove tags
            replace(/^\s\n*$/gm, '');  //remove empty lines
        }

In the general case, you cannot parse HTML accurately with a regular expression. You'd be better off letting something (the browser itself, if that's where your code runs) parse the HTML for you, and then you can traverse the DOM looking for text nodes. — Pointy
– Pointy, Commented Sep 20, 2018 at 13:18
Just use document.getElementsByTagName("body")[0].innerText — Arun Kumar
– Arun Kumar, Commented Sep 20, 2018 at 13:19
I am not running this on a client. I am parsing the HTML code as a normal string — Dipesh Desai
– Dipesh Desai, Commented Sep 24, 2018 at 4:08

scniro · Accepted Answer · 2018-09-20 13:21:18Z

3

No need to over think it, you can just document.body.innerText

A Sample Document
Some strong and emphasized text

JSFiddle example

answered Sep 20, 2018 at 13:21

scniro

17k8 gold badges67 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dipesh Desai Over a year ago

Would be glad if you could help me out with a REGEX solution

scniro Over a year ago

@DipeshDesai that would be an unwise implementation. Smarter not harder

Collectives™ on Stack Overflow

Extracting <BODY> text of HTML document in node js using REGEX [duplicate]

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related