6

I want to extract key value pairs of some form elements in a html page

for example

name="frmLogin" method="POST" onSubmit="javascript:return validateAndSubmit();" action="TG_cim_logon.asp?SID=^YcMunDFDQUoWV32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Lite_Home"

while the original line is

<form name="frmLogin" method="POST" onSubmit="javascript:return validateAndSubmit();" action="TG_cim_logon.asp?SID=^YcMunDFDQUoWV32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Lite_Home">

is there any method using which I can safely get the key and value pairs. I tried using splitting by spaces and then using '=' characters but string inside quotes can also have '='.

is there any different kind of split method which can also take care of quotes?

1
  • If you search for the values of the input elements of a form (as opposed to the attributes), then this answer fits: stackoverflow.com/a/65603777/633961 Commented Jan 7, 2021 at 7:19

6 Answers 6

7

Use a parsing library such as lxml.html for parsing html.

The library will have a simple way for you to get what you need, probably not taking more than a few steps:

  1. load the page using the parser

  2. choose the form element to operate on

  3. ask for the data you want

Example code:

>>> import lxml.html
>>> doc = lxml.html.parse('http://stackoverflow.com/questions/13432626/split-a-s
tring-in-python-taking-care-of-quotes')
>>> form = doc.xpath('//form')[0]
>>> form
<Element form at 0xbb1870>
>>> form.attrib
{'action': '/search', 'autocomplete': 'off', 'id': 'search', 'method': 'get'}
Sign up to request clarification or add additional context in comments.

Comments

2

You could use regular expressions like this one :

/([^=, ]+)="([^" ]+|[^," ]+)" ?"/

In python, you can do this :

#!/usr/bin/python

import re

text = 'name="frmLogin" method="POST" onSubmit="javascript:return validateAndSubmit();" action="TG_cim_logon.asp?SID=^YcMunDFDQUoWV32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Lite_Home"';

ftext = re.split( r'([^=, ]+)="([^" ]+|[^," ]+)" ?', text )

print ftext;

2 Comments

Escaped double quotes could pose a problem here.
@JanDvorak I've never seen any quotes escape in HTML, but indeed, it could be a problem...
1
s = r'name="frmLogin" method="POST" onSubmit="javascript:return validateAndSubmit();" action="TG_cim_logon.asp?SID=^YcMunDFDQUoWV
32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Lite_Home"'
>>> lst = s.split('" ')
>>> for item in lst:
...     print item.split('="')
... 
['name', 'frmLogin']
['method', 'POST']
['onSubmit', 'javascript:return validateAndSubmit();']
['action', 'TG_cim_logon.asp?SID=^YcMunDFDQUoWV32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Li
te_Home"']

Comments

0
{i.split('="')[0]: i.split('="')[1] for i in str.split("\" ")}

where str is your original string

Comments

0
dict=eval('dict(%s)'%name.replace(' ',','))
print dict
{'action': 'TG_cim_logon.asp?SID=^YcMunDFDQUoWV32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Lite_Home', 'onSubmit': 'javascript:return,validateAndSubmit();', 'method': 'POST', 'name': 'frmLogin'}

This will solve your problem .

Comments

0

You can use a library which has support for parsing HTML forms.

For example: https://mechanize.readthedocs.io/en/latest/

Stateful programmatic web browsing in Python. Browse pages programmatically with easy HTML form filling and clicking of links.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.