6

I'm trying to scrape UK Food Ratings Agency data aspx seach results pages (e.,g http://ratings.food.gov.uk/QuickSearch.aspx?q=po30 ) using Mechanize/Python on scraperwiki ( http://scraperwiki.com/scrapers/food_standards_agency/ ) but coming up with a problem when trying to follow "next" page links which have the form:

<input type="submit" name="ctl00$ContentPlaceHolder1$uxResults$uxNext" value="Next >" id="ctl00_ContentPlaceHolder1_uxResults_uxNext" title="Next >" />

The form handler looks like:

<form method="post" action="QuickSearch.aspx?q=po30" onsubmit="javascript:return WebForm_OnSubmit();" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_ContentPlaceHolder1_buttonSearch')" id="aspnetForm">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />

An HTTP trace when I manually click Next links shows __EVENTTARGET as empty? All the cribs I can find on other scrapers show the manipulation of __EVENTTARGET as the way of handling Next pages.

Indeed, I'm not sure how the page I want to scrape ever loads the next page? Whatever I throw at the scraper, it only ever manages to load the first results page. (Even being able to change the number of results per page would be useful, but I can't see how to do that either!)

So - any ideas on how to scrape the 1+N'th results pages for N>0?

2 Answers 2

8

Mechanize doesn´t handle javascript, but for this particular case it isn´t needed.

First we open the result page with mechanize

url = 'http://ratings.food.gov.uk/QuickSearch.aspx?q=po30'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open(url)
response = br.response().read()

Then we select the aspnet form:

br.select_form(nr=0) #Select the first (and only) form - it has no name so we reference by number

The form has 5 submit buttons - we want to submit the one that takes us to the next result page:

response = br.submit(name='ctl00$ContentPlaceHolder1$uxResults$uxNext').read()  #"Press" the next submit button

The other submit buttons in the form are:

ctl00$uxLanguageSwitch # Switch language to Welsh
ctl00$ContentPlaceHolder1$uxResults$Button1 # Search submit button
ctl00$ContentPlaceHolder1$uxResults$uxFirst # First result page
ctl00$ContentPlaceHolder1$uxResults$uxPrevious # Previous result page
ctl00$ContentPlaceHolder1$uxResults$uxLast # Last result page

In mechanize we can get form info like this:

for form in br.forms():
    print form
Sign up to request clarification or add additional context in comments.

Comments

2

Mechanize does not handle JavaScript.

There are many ways to handle this, however, including QtWebKit, python-spidermonkey, HtmlUnit (using Jython), or SeleniumRC.

Here is how it might be done with SeleniumRC:

import selenium
sel=selenium.selenium("localhost",4444,"*firefox", "http://ratings.food.gov.uk")   
sel.start()
sel.open("QuickSearch.aspx?q=po30")
sel.click('ctl00$ContentPlaceHolder1$uxResults$uxNext')

See also these related SO questions:

  1. How to click a link that has JavaScript
  2. Click on a JavaScript link within Python

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.