3

I have a number of web pages that I am attempting to parse information from obtained using curl. Each of the page uses JQuery to transform its content upon the document being loaded in the browser (using the document.ready function) - mostly setting the classes/ids of divs. The information is much easier to parse once the Javascript functions have been loaded.

What are my options for (preferably from the command line) executing the Javascript content of the pages and dumping the transformed HTML?

2
  • 1
    getfirebug.com/commandline ?? is this what you are looking for man. Commented May 20, 2012 at 8:41
  • +1 sounds interesting :) I thought about node.js for a while but that won't work for you =/ Commented May 20, 2012 at 8:44

1 Answer 1

2

To scrape dynamic web, don't use static download tools like curl.

If you want to scrape dynamic web use a headless webbrowser which you can control from your programming language. The most popular tool for this is Selenium

http://code.google.com/p/selenium/

With Selenium you can export modified DOM tree out of the browser as HTML.

An example use case:

https://stackoverflow.com/a/10053589/315168

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Mikko, I ended up using Selenium with the Java & Chrome bindings to load each page and subsequently dump the page source - it worked a treat!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.