1

I want to use vanilla js to loop through a string of html text and get its values. with jQuery I can do something like this

var str1="<div><h2>This is a heading1</h2><h2>This is a heading2</h2></div>";
$.each($(str1).find('h2'), function(index, value) {
/// console.log($(value).text());
});

using $(str) converts it to an html string as I understand it and we can then use .text() to get an element (h2)'s value. but I want to do this within my node app on the backend rather than on the client side, because it'd be more efficient (?) and also it'd just be nice to not rely on jQuery.

Some context, I'm working on a blogging app. I want a table of contents created into an object server side.

7
  • What exactly is your question? What are you trying to a achieve? Commented Jan 2, 2018 at 23:41
  • Why would you be having DOM nodes on the server where there is no DOM? Commented Jan 2, 2018 at 23:46
  • Well jQuery should work at backend, but considering not relying on it you would probably have to use some set of regular expression to find each element tag or the easiest way parse it through some document parser you can check the npm site for such parsers Commented Jan 2, 2018 at 23:46
  • @ScottMarcus probably web scraping for example Commented Jan 2, 2018 at 23:48
  • Cheerio can do this, but afaik it doesn't allow some things such as class manipulation. github.com/cheeriojs/cheerio Commented Jan 2, 2018 at 23:49

2 Answers 2

2

This is another way using .innerHTML but uses the built-in iterable protocol

Here's the operations we'll need, the types they have, and a link to the documentation of that function

  • Create an HTML element from a text
    String -> HTMLElement – provided by set Element#innerHTML

  • Get the text contents of an HTML element
    HTMLElement -> String – provided by get Element#innerHTML

  • Find nodes matching a query selector
    (HTMLElement, String) -> NodeList – provided by Element#querySelectorAll

  • Transform a list of nodes to a list of text
    (NodeList, HTMLElement -> String) -> [String] – provided by Array.from

// html2elem :: String -> HTMLElement
const html2elem = html =>
  {
    const elem = document.createElement ('div')
    elem.innerHTML = html
    return elem.childNodes[0]
  }

// findText :: (String, String) -> [String]
const findText = (html, selector) =>
  Array.from (html2elem(html).querySelectorAll(selector), e => e.textContent)

// str :: String  
const str =
  "<div><h1>MAIN HEADING</h1><h2>This is a heading1</h2><h2>This is a heading2</h2></div>";

console.log (findText (str, 'h2'))
// [
//   "This is a heading1",
//   "This is a heading2"
// ]
// :: [String]

console.log (findText (str, 'h1'))
// [
//   "MAIN HEADING"
// ]
// :: [String]

Sign up to request clarification or add additional context in comments.

Comments

1

The best way to parse HTML is to use the DOM. But, if all you have is a string of HTML, according to this Stackoverflow member) you may create a "dummy" DOM element to which you'd add the string to be able to manipulate the DOM, as follows:

var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>aTitle</title></head>
<body><div><h2>This is a heading1</h2><h2>This is a heading2</h2></div>
</body</html>";


Now you have a couple of ways to access the data using the DOM, as follows:

var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>aTitle</title></head><body><div><h2>This is a heading1</h2><h2>This is a heading2</h2></div></body</html>";
    
    // one way
    el.g = el.getElementsByTagName;
    var h2s = el.g("h2");
    for(var i = 0, max = h2s.length; i < max; i++){
        console.log(h2s[i].textContent);
        if (i == max -1) console.log("\n");
    }
    
    // and another
    var elementList = el.querySelectorAll("h2");
    for (i = 0, max = elementList.length; i < max; i++) {
        console.log(elementList[i].textContent);
    }

You may also use a regular expression, as follows:

var str = '<div><h2>This is a heading1</h2><h2>This is a heading2</h2></div>';

var re = /<h2>([^<]*?)<\/h2>/g;
var match;
var m = [];
var i=0;
while ( match = re.exec(str) ) {
    m.push(match.pop());
}
console.log(m);

The regex consists of an opening H2 tag followed by not a "<",followed by a closing H2 tag. The "*?" take into account zero or multiple instances of which there is at least zero or one instance.

Per Ryan of Stackoverflow:

exec with a global regular expression is meant to be used in a loop, as it will still retrieve all matched subexpressions.

The critical part of the regex is the "g" flag as per MDN. It allows the exec() method to obtain multiple matches in a given string. In each loop iteration, match becomes an array containing one element. As each element is popped off and pushed onto m, the array m ultimately contains all the captured text values.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.