Getting Final HTML with Javascript rendered Java as String

Question

I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.

Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp

This page has comments as a facebook plugin which are fetched as Javascript.

Also similar to this even on this. http://www.imdb.com/title/tt0848228/reviews

What should I do?

Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM. — Pointy
– Pointy, Commented Jun 3, 2012 at 17:27
There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents. — Fabrício Matté
– Fabrício Matté, Commented Jun 3, 2012 at 17:30

Ivan Castellanos · Accepted Answer · 2012-06-03 17:31:47Z

7

Use phantomjs: http://phantomjs.org

var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
    // Where you want to save it    
    page.render("screenshoot.png")  
    // You can access its content using jQuery
    var fbcomments = page.evaluate(function(){
        return $(".fb-comments iframe").contents().find(".postContainer") 
    }) 
},10000)

You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)

To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js

answered Jun 3, 2012 at 17:31

Ivan Castellanos

8,2521 gold badge51 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

KillerTheLord Over a year ago

@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java

Ivan Castellanos Over a year ago

Is a good thing that you don't want to do it with a potato; man... that would be hard!

Bizmate Over a year ago

@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?

user517491 · Accepted Answer · 2012-06-04 06:38:13Z

5

You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.

UPDATE: You were asking for example? You don't have to do anything extra for doing that:

Example:

WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));

UPDATE 2: You can get iframe as follows:

HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();

Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit

answered Jun 4, 2012 at 6:38

user517491

2 Comments

Freak Over a year ago

But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions

Konrad G Over a year ago

Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)

Jose Wamba · Accepted Answer · 2020-03-24 20:44:50Z

0

The simple way to solve that problem. Hello, you can use HtmlUnit is java API, i think it can help you to access the executed js content, as a simple html.

WebClient webClient = new WebClient();
HtmlPage myPage = (HtmlPage) webClient.getPage(new URL("YourURL"));
System.out.println(myPage.getVisibleText());

answered Mar 24, 2020 at 20:44

Jose Wamba

412 bronze badges

Collectives™ on Stack Overflow

Getting Final HTML with Javascript rendered Java as String

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related