40

I need to fill form values on a target page then click a button via Python. I've looked at Selenium and Windmill, but these are testing frameworks - I'm not testing. I'm trying to log into a 3rd party website programatically, then download and parse a file we need to insert into our database. The problem with the testing frameworks is that they launch instances of browsers; I just want a script I can schedule to run daily to retrieve the page I want. Any way to do this?

5 Answers 5

33

You are looking for Mechanize

Form submitting sample:

import re
from mechanize import Browser

br = Browser()
br.open("http://www.example.com/")
br.select_form(name="order")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm (from ClientForm).
br["cheeses"] = ["mozzarella", "caerphilly"]  # (the method here is __setitem__)
response = br.submit()  # submit current form
Sign up to request clarification or add additional context in comments.

5 Comments

I'm stuck using Python 2.6 though, so sadly Mechanize isn't an option either. (GopherError dropped in 2.6, looks like).
Mechanize doc is usually a bit terse, but it works really really great !
I think you should insist, try debugging the gopher problem. In python 2.6, gopher support was removed IIRC, so fixing your problem is probably about commenting some import gopherlib and the few spots where gopher is actually used.
@Habaabiai: Mechanize advertises working in 2.6, you could ask a question about your problem with it. Also, you can try urllib2 (which will force you to write more code to submit a form.)
It seems Mechanize does not support python 3 (yet...?), I guess it means it is not maintained anymore (it is the first FAQ at wwwsearch.sourceforge.net/mechanize/faq.html).
20

Have a look on this example which use Mechanize: it will give the basic idea:

#!/usr/bin/python
import re 
from mechanize import Browser
br = Browser()

# Ignore robots.txt
br.set_handle_robots( False )
# Google demands a user-agent that isn't a robot
br.addheaders = [('User-agent', 'Firefox')]

# Retrieve the Google home page, saving the response
br.open( "http://google.com" )

# Select the search box and search for 'foo'
br.select_form( 'f' )
br.form[ 'q' ] = 'foo'

# Get the search results
br.submit()

# Find the link to foofighters.com; why did we run a search?
resp = None
for link in br.links():
    siteMatch = re.compile( 'www.foofighters.com' ).search( link.url )
    if siteMatch:
        resp = br.follow_link( link )
        break

# Print the site
content = resp.get_data()
print content

Comments

8

You can use the standard urllib library to do this like so:

import urllib

urllib.urlretrieve("http://www.google.com/", "somefile.html", lambda x,y,z:0, urllib.urlencode({"username": "xxx", "password": "pass"}))

Comments

4

The Mechanize example as suggested seems to work. In input fields where you must enter text, use something like:

br["kw"] = "rowling"  # (the method here is __setitem__)

If some content is generated after you submit the form, as in a search engine, you get it via:

print response.read()

Comments

4

For checkboxes, use 1 & 0 as true & false respectively:

br["checkboxname"] = 1 #checked = true
br["checkboxname2"] = 0 #checked = false

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.