75

I am trying to get the inner text of HTML string, using a JS function(the string is passed as an argument). Here is the code:

function extractContent(value) {
  var content_holder = "";

  for (var i = 0; i < value.length; i++) {
    if (value.charAt(i) === '>') {
      continue;
      while (value.charAt(i) != '<') {
        content_holder += value.charAt(i);
      }
    }

  }
  console.log(content_holder);
}

extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");

The problem is that nothing gets printed on the console(*content_holder* stays empty). I think the problem is caused by the === operator.

5

11 Answers 11

132

Create an element, store the HTML in it, and get its textContent:

function extractContent(s) {
  var span = document.createElement('span');
  span.innerHTML = s;
  return span.textContent || span.innerText;
};
    
alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));


Here's a version that allows you to have spaces between nodes, although you'd probably want that for block-level elements only:

function extractContent(s, space) {
  var span= document.createElement('span');
  span.innerHTML= s;
  if(space) {
    var children= span.querySelectorAll('*');
    for(var i = 0 ; i < children.length ; i++) {
      if(children[i].textContent)
        children[i].textContent+= ' ';
      else
        children[i].innerText+= ' ';
    }
  }
  return [span.textContent || span.innerText].toString().replace(/ +/g,' ');
};
    
console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>.  Nice to <em>see</em><strong><em>you!</em></strong>"));

console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>.  Nice to <em>see</em><strong><em>you!</em></strong>",true));

Sign up to request clarification or add additional context in comments.

8 Comments

Outputs HelloW3C - really what OP wanted? Not Hello W3C?
No, white spaces are not required :) Sorry for not mentioning it!
Added a version that can add spaces between nodes.
WARNING: this is subject to XSS attacks. Only assign to innerHTML if you know and control the parameter's contents to reasonable degree.
@Gangula, you should use DOMParser, which wasn't widely available when I posted this in 2015.
|
89

One line (more precisely, one statement) version:

function extractContent(html) {
    return new DOMParser()
        .parseFromString(html, "text/html")
        .documentElement.textContent;
}

5 Comments

nice answer +1, but what is the difference between your answer and Rick Hitchcock answer
@shariqueansari, DOMParser is "experimental technology" but likely to be added to the spec. Its HTML support works in IE10+. My original answer worked in IE9+, but I've now updated it to support IE8.
DOMParser now has wide support, see caniuse.com/#search=domparser
hoped this would work on nodejs but it doesnt. ended up using npmjs.com/package/html2plaintext
Can We use this method for extract some contents by id like: document.getElementById ?
42

textContext is a very good technique for achieving desired results but sometimes we don't want to load DOM. So simple workaround will be following regular expression:

let htmlString = "<p>Hello</p><a href='http://w3c.org'>W3C</a>"
let plainText = htmlString.replace(/<[^>]+>/g, '');

6 Comments

I know this is a very old comment, but could you please explain the meaning of the expression /<[^>]+>/g ? I'm having trouble understanding what each individual character means.
@Kelly The symbols you are referring to are a regular expression. It's kind of like a mini-programming language for parsing text. Here's a link to where you can learn more about each symbol: developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/…
It essentially says to find and remove each < that has stuff that is not a > between it and a >.
most helpful, regex, one of the best tool/mini-language for coders.
Different technique for different cases, and this is the right approach for my case, Telegram's bot development that require no innerHTML or something that required in web development.
|
8

use this regax for remove html tags and store only the inner text in html

it shows the HelloW3c only check it

var content_holder = value.replace(/<(?:.|\n)*?>/gm, '');

3 Comments

please give me a reason please?
If you are going to use regexp, then a simpler version would be /<[\s\S]*?>/, or /<[^]*?>/. Your m flag accomplishes nothing; it relates to the behavior of ^ and $.
3

For Node.js

This will use the jsdom library, since node.js doesn't have dom features as in browser.

import * as jsdom from "jsdom";

const html = "<h1>Testing<h1>";
const text = new jsdom.JSDOM(html).window.document.textContent;

console.log(text);

Comments

2

Try This:-

<!DOCTYPE html>
<html>
<body>
<script type="text/javascript">
function extractContent(value){
        var div = document.createElement('div')
        div.innerHTML=value;
        var text= div.textContent;            
        return text;
}
window.onload=function()
{
   alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));
};
</script>
</body>
</html>

2 Comments

Did you test this? It fails to extract "W3C" as it should.
Please try your solution with the string Hello, <p>Buggy<i>World</i></p>.
0

You could temporarily write it out to a block level element that is positioned off the page .. some thing like this:

HTML:

<div id="tmp" style="position:absolute;top:-400px;left:-400px;">
</div>

JavaScript:

<script type="text/javascript">
function extractContent(value){
        var div=document.getElementById('tmp');
        div.innerHTML=value;
        console.log(div.children[0].innerHTML);//console out p
}

extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");
</script>

2 Comments

Right approach, but you don't need an element in the DOM to do this. Just create an element with var div = document.createElement('div') and proceed from there.
Also, this will fail with nested HTML elements, such as <p>Hello<i>Bob</i></p><a>...</a>. It will retain the markup inside the p element.
0

Based on Rick Hitchcock answer AND KevBot's, this is how I found the best way to do it :

function getTextLoop(element: HTMLElement | ChildNode) {
  const texts = [];
  Array.from(element.childNodes).forEach((node) => {
    if (node.nodeType === 3) {
      texts.push(node.textContent.trim());
    } else {
      texts.push(...getTextLoop(node));
    }
  });
  return texts;
}

function innerText(element: HTMLElement) {
  return getTextLoop(element).join(" ");
}

export function extractContent(s, space) {
  var span = document.createElement("span");
  span.innerHTML = s;
  if (space) {
    span.innerHTML = innerText(span);
  }
  return [span.textContent || span.innerText].toString().replace(/ +/g, " ");
}

Example :

extractContent("<div>foo<div>bar</div></div>", true); // foo bar

Comments

-1

Using jQuery, in jQuery we can add comma seperated tags.

var readableText = [];
$("p, h1, h2, h3, h4, h5, h6").each(function(){ 
     readableText.push( $(this).text().trim() );
})
console.log( readableText.join(' ') );

Comments

-1

Use match() function to bring out HTML tags

const text = `<div>Hello World</div>`;
console.log(text.match(/<[^>]*?>/g));

1 Comment

That does the opposite of what is being asked for (and even if it didn't, regex are terrible tools for processing HTML, consider what would happen given the input <div>If maths, 2 < 3</div>).
-3

you need array to hold values

  function extractContent(value) {
var content_holder = new Array();

for(var i=0;i<value.length;i++) {
    if(value.charAt(i) === '>') {
        continue;
        while(value.charAt(i) != '<') {
            content_holder.push(value.charAt(i));
            console.log(content_holder[i]);
        }
    }
}
}extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");

3 Comments

You don't need an array, if you want the results to be a string but having an array will allow the user to access each result/value.
You haven't fixed the basic logic error in the OP code. Did you test this?
I'm going to guess "not"

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.