0

I am writing a code to extract all the plain content from the tag of the HTML code.I know it can be done using the document element. But I need to do this using REGEX only I have written the following code, but it has some bugs which I am not able to figure out on how to solve it.

function htmlToText(html) {
      return html.
        replace(/(.|\n)*<body.*>/, ''). //remove up till body
        replace(/<\/body(.|\n)*/, ''). //remove from </body
        replace(/<.+\>/, ''). //remove tags
        replace(/^\s\n*$/gm, '');  //remove empty lines
    }

Here is the solution for it

function htmlToText(html) {
          return html.
            replace(/(.|\n)*<body.*>/, ''). //remove up till body
            replace(/<\/body(.|\n)*/g, ''). //remove from </body
            replace(/<.+\>/g, ''). //remove tags
            replace(/^\s\n*$/gm, '');  //remove empty lines
        }
3
  • In the general case, you cannot parse HTML accurately with a regular expression. You'd be better off letting something (the browser itself, if that's where your code runs) parse the HTML for you, and then you can traverse the DOM looking for text nodes. Commented Sep 20, 2018 at 13:18
  • 1
    Just use document.getElementsByTagName("body")[0].innerText Commented Sep 20, 2018 at 13:19
  • I am not running this on a client. I am parsing the HTML code as a normal string Commented Sep 24, 2018 at 4:08

1 Answer 1

3

No need to over think it, you can just document.body.innerText

A Sample Document
Some strong and emphasized text

JSFiddle example

Sign up to request clarification or add additional context in comments.

2 Comments

Would be glad if you could help me out with a REGEX solution
@DipeshDesai that would be an unwise implementation. Smarter not harder

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.