Extract the text out of HTML string using JavaScript

Question

I am trying to get the inner text of HTML string, using a JS function(the string is passed as an argument). Here is the code:

function extractContent(value) {
  var content_holder = "";

  for (var i = 0; i < value.length; i++) {
    if (value.charAt(i) === '>') {
      continue;
      while (value.charAt(i) != '<') {
        content_holder += value.charAt(i);
      }
    }

  }
  console.log(content_holder);
}

extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");

The problem is that nothing gets printed on the console(*content_holder* stays empty). I think the problem is caused by the === operator.

Your while loop is never reached due to the continue instruction. — Arnaud Christ
– Arnaud Christ, Commented Mar 6, 2015 at 13:13
Try tracing through your code with a "debugger"--did you do that? — user663031
– user663031, Commented Mar 6, 2015 at 13:22
Possible duplicate of JS: Extract text from a string without jQuery — Rehan Haider
– Rehan Haider, Commented May 11, 2018 at 7:42
Does this answer your question? Get the pure text without HTML element by javascript? — KyleMit
– KyleMit ♦, Commented Jan 11, 2020 at 20:08

thedayturns · Accepted Answer · 2018-07-25 16:30:54Z

132

Create an element, store the HTML in it, and get its textContent:

function extractContent(s) {
  var span = document.createElement('span');
  span.innerHTML = s;
  return span.textContent || span.innerText;
};
    
alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));

Here's a version that allows you to have spaces between nodes, although you'd probably want that for block-level elements only:

function extractContent(s, space) {
  var span= document.createElement('span');
  span.innerHTML= s;
  if(space) {
    var children= span.querySelectorAll('*');
    for(var i = 0 ; i < children.length ; i++) {
      if(children[i].textContent)
        children[i].textContent+= ' ';
      else
        children[i].innerText+= ' ';
    }
  }
  return [span.textContent || span.innerText].toString().replace(/ +/g,' ');
};
    
console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>.  Nice to <em>see</em><strong><em>you!</em></strong>"));

console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>.  Nice to <em>see</em><strong><em>you!</em></strong>",true));

edited Jul 25, 2018 at 16:30

thedayturns

11k5 gold badges38 silver badges46 bronze badges

answered Mar 6, 2015 at 13:16

Rick Hitchcock

35.7k5 gold badges51 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

davidkonrad Over a year ago

Outputs HelloW3C - really what OP wanted? Not Hello W3C?

Toshkuuu Over a year ago

No, white spaces are not required :) Sorry for not mentioning it!

Rick Hitchcock Over a year ago

Added a version that can add spaces between nodes.

Toni Over a year ago

WARNING: this is subject to XSS attacks. Only assign to innerHTML if you know and control the parameter's contents to reasonable degree.

Rick Hitchcock Over a year ago

@Gangula, you should use DOMParser, which wasn't widely available when I posted this in 2015.

|

DollarAkshay · Accepted Answer · 2022-01-26 14:46:21Z

89

One line (more precisely, one statement) version:

function extractContent(html) {
    return new DOMParser()
        .parseFromString(html, "text/html")
        .documentElement.textContent;
}

edited Jan 26, 2022 at 14:46

DollarAkshay

2,1191 gold badge21 silver badges41 bronze badges

answered Mar 6, 2015 at 13:58

user663031

5 Comments

Sharique Hussain Ansari Over a year ago

nice answer +1, but what is the difference between your answer and Rick Hitchcock answer

Rick Hitchcock Over a year ago

@shariqueansari, DOMParser is "experimental technology" but likely to be added to the spec. Its HTML support works in IE10+. My original answer worked in IE9+, but I've now updated it to support IE8.

Optimae Over a year ago

DOMParser now has wide support, see caniuse.com/#search=domparser

Flion Over a year ago

hoped this would work on nodejs but it doesnt. ended up using npmjs.com/package/html2plaintext

Hamid Araghi Over a year ago

Can We use this method for extract some contents by id like: document.getElementById ?

Mubeen Khan · Accepted Answer · 2019-01-24 10:42:07Z

42

textContext is a very good technique for achieving desired results but sometimes we don't want to load DOM. So simple workaround will be following regular expression:

let htmlString = "<p>Hello</p><a href='http://w3c.org'>W3C</a>"
let plainText = htmlString.replace(/<[^>]+>/g, '');

answered Jan 24, 2019 at 10:42

Mubeen Khan

1,0651 gold badge11 silver badges11 bronze badges

6 Comments

KP. Over a year ago

I know this is a very old comment, but could you please explain the meaning of the expression /<[^>]+>/g ? I'm having trouble understanding what each individual character means.

Kade Over a year ago

@Kelly The symbols you are referring to are a regular expression. It's kind of like a mini-programming language for parsing text. Here's a link to where you can learn more about each symbol: developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/…

Kade Over a year ago

It essentially says to find and remove each < that has stuff that is not a > between it and a >.

GD- Ganesh Deshmukh Over a year ago

most helpful, regex, one of the best tool/mini-language for coders.

hanism Over a year ago

Different technique for different cases, and this is the right approach for my case, Telegram's bot development that require no innerHTML or something that required in web development.

|

Rana Ahmer Yasin · Accepted Answer · 2015-03-06 13:11:11Z

8

use this regax for remove html tags and store only the inner text in html

it shows the HelloW3c only check it

var content_holder = value.replace(/<(?:.|\n)*?>/gm, '');

answered Mar 6, 2015 at 13:11

Rana Ahmer Yasin

5353 silver badges17 bronze badges

3 Comments

Rana Ahmer Yasin Over a year ago

please give me a reason please?

user663031 Over a year ago

stackoverflow.com/questions/1732348/…

user663031 Over a year ago

If you are going to use regexp, then a simpler version would be /<[\s\S]*?>/, or /<[^]*?>/. Your m flag accomplishes nothing; it relates to the behavior of ^ and $.

Abraham · Accepted Answer · 2022-09-11 23:47:07Z

3

For Node.js

This will use the jsdom library, since node.js doesn't have dom features as in browser.

import * as jsdom from "jsdom";

const html = "<h1>Testing<h1>";
const text = new jsdom.JSDOM(html).window.document.textContent;

console.log(text);

answered Sep 11, 2022 at 23:47

Abraham

16.3k12 gold badges91 silver badges124 bronze badges

Comments

Sharique Hussain Ansari · Accepted Answer · 2015-03-06 13:47:44Z

2

Try This:-

<!DOCTYPE html>
<html>
<body>
<script type="text/javascript">
function extractContent(value){
        var div = document.createElement('div')
        div.innerHTML=value;
        var text= div.textContent;            
        return text;
}
window.onload=function()
{
   alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));
};
</script>
</body>
</html>

edited Mar 6, 2015 at 13:47

answered Mar 6, 2015 at 13:14

Sharique Hussain Ansari

1,4561 gold badge12 silver badges22 bronze badges

2 Comments

user663031 Over a year ago

Did you test this? It fails to extract "W3C" as it should.

user663031 Over a year ago

Please try your solution with the string Hello, <p>Buggy<i>World</i></p>.

Adam MacDonald · Accepted Answer · 2015-03-06 13:06:49Z

0

You could temporarily write it out to a block level element that is positioned off the page .. some thing like this:

HTML:

<div id="tmp" style="position:absolute;top:-400px;left:-400px;">
</div>

JavaScript:

<script type="text/javascript">
function extractContent(value){
        var div=document.getElementById('tmp');
        div.innerHTML=value;
        console.log(div.children[0].innerHTML);//console out p
}

extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");
</script>

answered Mar 6, 2015 at 13:06

Adam MacDonald

1,95815 silver badges19 bronze badges

2 Comments

user663031 Over a year ago

Right approach, but you don't need an element in the DOM to do this. Just create an element with var div = document.createElement('div') and proceed from there.

user663031 Over a year ago

Also, this will fail with nested HTML elements, such as <p>Hello<i>Bob</i></p><a>...</a>. It will retain the markup inside the p element.

Bardelman · Accepted Answer · 2023-03-24 15:15:44Z

Based on Rick Hitchcock answer AND KevBot's, this is how I found the best way to do it :

function getTextLoop(element: HTMLElement | ChildNode) {
  const texts = [];
  Array.from(element.childNodes).forEach((node) => {
    if (node.nodeType === 3) {
      texts.push(node.textContent.trim());
    } else {
      texts.push(...getTextLoop(node));
    }
  });
  return texts;
}

function innerText(element: HTMLElement) {
  return getTextLoop(element).join(" ");
}

export function extractContent(s, space) {
  var span = document.createElement("span");
  span.innerHTML = s;
  if (space) {
    span.innerHTML = innerText(span);
  }
  return [span.textContent || span.innerText].toString().replace(/ +/g, " ");
}

Example :

extractContent("<div>foo<div>bar</div></div>", true); // foo bar

Joy · Accepted Answer · 2022-03-10 16:07:56Z

-1

Using jQuery, in jQuery we can add comma seperated tags.

var readableText = [];
$("p, h1, h2, h3, h4, h5, h6").each(function(){ 
     readableText.push( $(this).text().trim() );
})
console.log( readableText.join(' ') );

answered Mar 10, 2022 at 16:07

Joy

457 bronze badges

Comments

Deepak Singh · Accepted Answer · 2023-01-13 06:00:39Z

-1

Use match() function to bring out HTML tags

const text = `<div>Hello World</div>`;
console.log(text.match(/<[^>]*?>/g));

answered Jan 13, 2023 at 6:00

Deepak Singh

1,1879 silver badges18 bronze badges

1 Comment

Quentin Mar 17 at 10:51

That does the opposite of what is being asked for (and even if it didn't, regex are terrible tools for processing HTML, consider what would happen given the input <div>If maths, 2 < 3</div>).

Dane · Accepted Answer · 2015-03-06 13:05:49Z

-3

you need array to hold values

  function extractContent(value) {
var content_holder = new Array();

for(var i=0;i<value.length;i++) {
    if(value.charAt(i) === '>') {
        continue;
        while(value.charAt(i) != '<') {
            content_holder.push(value.charAt(i));
            console.log(content_holder[i]);
        }
    }
}
}extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");

answered Mar 6, 2015 at 13:05

Dane

815 bronze badges

3 Comments

NewToJS Over a year ago

You don't need an array, if you want the results to be a string but having an array will allow the user to access each result/value.

user663031 Over a year ago

You haven't fixed the basic logic error in the OP code. Did you test this?

NewToJS Over a year ago

I'm going to guess "not"

Collectives™ on Stack Overflow

Extract the text out of HTML string using JavaScript

11 Answers 11

8 Comments

5 Comments

6 Comments

3 Comments

For Node.js

Comments

2 Comments

2 Comments

Comments

Comments

1 Comment

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

8 Comments

5 Comments

6 Comments

3 Comments

For Node.js

Comments

2 Comments

2 Comments

Comments

Comments

1 Comment

3 Comments

Linked

Related