Python, parse html form

Question

how I can get input from html forms on other sites? I want it to return a dictionary such as:

form = [('name' = 'somename', 'type' = 'text', 'value':''},{' name' = 'somename', 'type' = 'submit', 'value': ' submit ').

Sorry for my English.

Are you trying to parse a HTML file (possibly returned from urllib.urlopen-ing a url), or is this some Django based thing? — Stephen
– Stephen, Commented Aug 22, 2010 at 10:27

Tim McNamara · Accepted Answer · 2010-08-22 10:35:09Z

3

you probably wont be able to retrieve form data from other users on other sites. If you wish to use a script to send data to a form, mechanize is one tool that makes this quite easy.

answered Aug 22, 2010 at 10:35

Tim McNamara

18.5k5 gold badges54 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Artyom Over a year ago

Thanks for the answer, but forms unfortunately not static and each time different, therefore is necessary the full analysis. In mechanize it will be not absolutely convenient.

Ivo van der Wijk Over a year ago

In that case, use lxml.html to parse the document, find form and input tags (possibly using xpath queries), and so on.

Tim McNamara Over a year ago

Derek, surely the forms are being generated using the <form> tag. This should be all that you need to get started. If the forms are indeterministic, no script will be able to assist you. If you mean the forms are generated by client-side JavaScript, then browser automation may help.

Yuda Prawira · Accepted Answer · 2010-09-21 18:11:29Z

2

Yeah mechanize is sweet !

import mechanize

# Browser
br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

# we inspect the all form element in the http://stackoverflow.com
br.open('http://stackoverflow.com')
for form in br.forms():
    print form

answered Sep 21, 2010 at 18:11

Yuda Prawira

12.5k10 gold badges49 silver badges55 bronze badges

Comments

Ivo van der Wijk · Accepted Answer · 2010-08-22 10:35:05Z

1

Look at mechanize, lxml.html and BeatifulSoup.

answered Aug 22, 2010 at 10:35

Ivo van der Wijk

16.8k4 gold badges45 silver badges57 bronze badges

2 Comments

OTZ Over a year ago

BeautifulSoup is discontinued. Better not to mention.

Tim McNamara Over a year ago

BeautifulSoup is also much slower than lxml.html

Collectives™ on Stack Overflow

Python, parse html form

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related