11

I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.

Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp

This page has comments as a facebook plugin which are fetched as Javascript.

Also similar to this even on this. http://www.imdb.com/title/tt0848228/reviews

What should I do?

2
  • 1
    Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM. Commented Jun 3, 2012 at 17:27
  • There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents. Commented Jun 3, 2012 at 17:30

3 Answers 3

7

Use phantomjs: http://phantomjs.org

var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
    // Where you want to save it    
    page.render("screenshoot.png")  
    // You can access its content using jQuery
    var fbcomments = page.evaluate(function(){
        return $(".fb-comments iframe").contents().find(".postContainer") 
    }) 
},10000)

You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)

To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js

Sign up to request clarification or add additional context in comments.

3 Comments

@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
Is a good thing that you don't want to do it with a potato; man... that would be hard!
@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
5

You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.

UPDATE: You were asking for example? You don't have to do anything extra for doing that:

Example:

WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));

UPDATE 2: You can get iframe as follows:

HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();

Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit

2 Comments

But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
0

The simple way to solve that problem. Hello, you can use HtmlUnit is java API, i think it can help you to access the executed js content, as a simple html.

WebClient webClient = new WebClient();
HtmlPage myPage = (HtmlPage) webClient.getPage(new URL("YourURL"));
System.out.println(myPage.getVisibleText());

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.