2

I'm pretty new with NodeJs.
I'm trying to download some html from a website in order to parse it and present some information for debug.
I try with success with http module (see this post), but in this way when I print chunk:

var req = http.request(options, function(res) {
    res.setEncoding("utf8");
    res.on("data", function (chunk) {
       console.log(chunk);
    });
});

I don't get all html that is loaded dynamically with ajax for instance:

<div class="container">
  ::before
      <div class="row">
        ::before
....
</div>

Are there any other module that can help me on this goal?

Thanks!

update

I would like to share with you my success (thanks to @oKonyk).

  • npm install phantomjs
  • create your script
  • use the same code suggested by @oKonyk

note that if you're running your script locally, you need to set this options:

options = { 'web-security': 'no' };
phantom.create({parameters: options}, function() {});

1 Answer 1

4

In order to capture dynamically built pages you have to render them in browser. There are several options to do that with node.js.

I would suggest using phantomjs, which is a so called headless browser.

In order to proof the concept you can install npm install phantomjs -g globally. Create test script 'google.js' with following content:

var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.google.org', function(status) {
  if (status !== 'success') {
    console.log('Unable to access network');
  } else {
    var html = page.evaluate(function() {
      return document.getElementsByTagName('html')[0].innerHTML;
    });
    console.log(html);
  }
  phantom.exit();
});

Then run it as phantomjs google.js

You will get printed whole DOM of the page (at lest everything within <html> tags), which different from raw response that you are getting with http module.

Later you can use phantom within your node project (more info here).

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for answer @oKonyk. With this solution I get "phantom stdout: NETWORK_ERR: XMLHttpRequest Exception 101: A network error occurred in synchronous requests." I'm running script from localhost, so I suppose that I have to setup "--web-security=false", isn't it? Where can I set this option? I'm running my script with node.
I've created phantom.create({'web-security':'no'}, function (ph) {}); but errors still appear!
So the URL that you are trying to access is localhost? If it's just any public URL, I could try to access it from my system... When you are saying that you run your script with node, do you mean you have your phantomjs process as child process on your main node script?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.