1

I'm scrapping some HTML pages with Rails, using Nokogiri.

I had some problems when I tried to scrap an AngularJS page because the gem is opening the HTML before it has been fully rendered.

Is there some way to scrap this type of page? How can I have the page fully rendered before scraping it?

2
  • 3
    You may need to use something like PhantomJS to fully render pages with JavaScript on them. Commented Nov 19, 2014 at 21:09
  • The problem is that the page is loading content dynamically. Turn off JavaScript in your browser, and hit the page, and you'll see what your code is seeing, since neither the code, or the browser at that point, interpret and run JavaScript. Nokogiri has no way of "opening the HTML", it only parses what is given to it. Commented Nov 19, 2014 at 23:23

2 Answers 2

4

If you're trying to scrape AngularJS pages in a fully generic fashion, then you're likely going to need something like what @tadman mentioned in the comments (PhantomJS) -- some type of headless browser that fully processes the AngularJS JavaScript and opens the DOM up to inspection afterwards.

If you have a specific site or sites that you are looking to scrape, the path of least resistance is likely to avoid the AngularJS frontend entirely and directly query the API from which the Angular code is pulling content. The standard scenario for many/most AngularJS sites is that they pull down the static JS and HTML code/templates, and then they make ajax calls back to a server (either their own, or some third party API) to get content that will be rendered. If you take a look at their code, you can likely directly query whatever angular is calling (i.e. via $http, ngResource, or restangular). The return data is typically JSON and would be much easier to gather vs. true scraping in the post-rendered html result.

Sign up to request clarification or add additional context in comments.

1 Comment

nice answer @Mike however most sites don't want to get scraped and build in protections from that type of behavior, and its not so simple as to recreate a simple call when you need multiple authentication to get a response
1

You can use:

require 'phantomjs'
require 'watir'

b = Watir::Browser.new(:phantomjs)
b.goto URL

doc = Nokogiri::HTML(b.html)

Download phantomjs in http://phantomjs.org/download.html and move the binary for /usr/bin

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.