Python 3, Web-scraping, and Javascript [Oh My]

Question

I have come to the point of entering the melee on web-scraping webpages using Javascript, with Python3. I am well aware that my boot may be making contact with a dead horse, but I feel like drawing my six-shooter anyway. It's a spaghetti western; be my gray hat?

::Backstory::

I am using Python 3.2.3.

I am interested in gathering historical stock//etf//mutual_fund price data for YTD, 1-yr, 3-yr, 5-yr 10-yr... and/or similar timeframes for a user-defined stock, etf, or mutual fund. I set my sites on Morningstar.com, as they tend to provide as much data as possible without necessarily requiring a log-in; other folks such as finance.google.com &c tend to be inconsistent in what data they provide regarding stocks vs etfs vs mutual funds.

The trade-off in using Morningstar for this historical data, or "Trailing Total Returns" as they call it, is that for producing this data they use Javascript.

Here are some example links from Morningstar:

A Mutual Fund;

An ETF;

A Stock.

I am interested in the "Trailing Returns" portion, top row or so of numbers in the Javascript-produced chart.

::Attempted So Far::

I've confirmed that wget doesn't play with Javascript; even downloading all of the associated files [css, .js, &c] hasn't allowed me to locally render the javascript in browser or in script. Research here on StackOverflow confirmed this. Am willing to be corrected here.

My research informed me that Mechanize doesn't exist for Python3. I tried anyway, and turned into Policeman Javert crying out "I knew it!" at the error message "module does not exist".

::I've Heard Of...::

->Selenium. However, my understanding is that this requires Thy Favorite Browser to actually open up a webpage, navigate around, and then not close because there's no "close this tab//window" command//option for Selenium. What if I//my_user want to get historical data for many etfs, stocks, and/or mutual funds? That's a lot of tabs//windows opening up in a browser which was not necessarily desired to be opened.

->httplib2. I think this is nice, but I'm doubtful if it will play with Javascript. Does it, for example using the .cache and get options?

import httplib2
conn = httplib2.Http(".cache")
page = conn.request(u"http://the_url","GET")

->Windmill. See 'Selenium'. I am, however, off-key enough to sing 'Man of La Mancha'.

->Google's webscraping code. Would an attempt at downloading a Javascript-laden page result in ... positive results?

I've read chatter about having to "emulating a browser without a browser". Sounds like Mechanize, but not for Python3 as I currently understand.

::My Question::

Any suggestions, pointers, solutions, or "look over here" directions?

Many thanks,

Miles, Dusty Desert Villager.

By using Ajax requests without any kind of session cookie check to at least filter requests from other domains, them doing it with JavaScript basically made your job really, really easy. — Erik Reppen
– Erik Reppen, Commented Aug 3, 2012 at 23:52

Cypress Frankenfeld · Accepted Answer · 2012-08-03 23:54:20Z

11

When a page loads data via javascript, it has to make requests to the server to get that data via the XMLHttpRequest function (XHR). You can see what requests they are making, and then make them yourself, using wget!

To find out which requests they are making, use the Web Inspector (Chrome and Safari) or Firebug (Firefox). Here's how to do it in Chrome:

wrench/tools/developer tools/Network (tab at the top of the tools)/XHR filter at the bottom.

Here's an example request they make in javascript

If you look closely at the XHR request url, you notice that all trailing returns have the same format:

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=

You just need to specify t. For example:

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VAW http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=INTC http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VHCOX

Now you can wget those URIs and parse out the data directly.

edited Aug 3, 2012 at 23:54

answered Aug 3, 2012 at 23:26

Cypress Frankenfeld

2,5652 gold badges32 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Erik Reppen Over a year ago

I need to stop point-whoring. This works better as a comment to the original answer that got it right basically: "Chrome makes it pretty easy to see the XHR requests. Developer Tools (wrench or right-click/inspect-element)/Network (tab at the top of the tools)/XHR filter at the bottom."

Erik Reppen Over a year ago

By XHR he means http requests direct from JS. Or Ajax if you must. (I'd rather we didn't). All you have to scrape is tables and JSON. Not bad.

Cypress Frankenfeld Over a year ago

Thanks Erik! : ) I'll amend my answer to be more clear. It's not very easy to understand for someone who hasn't used dev tools in Chrome before.

MilesNielsen Over a year ago

Many thanks Cypress, this was buggering the heck of me. I shall give this a spin XD

MilesNielsen Over a year ago

... and an awesome tip on the Firebug & similar; I was able to follow this methodology with same//similar results. Again, many thanks.

Collectives™ on Stack Overflow

Python 3, Web-scraping, and Javascript [Oh My]

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related