0

Say I have many options in a HTML page (opened as text file) as below,

<select id="my">
  <option id="1">1a</option>
  <option id="2">2bb</option>     
</select>

<select id="my1">
  <option id="11">11a</option>
  <option id="21">21bb</option>     
</select>

Now, I've searched for <select id=

with open('/u/poolla/Downloads/creat/xyz.txt') as f:
for line in f:
    line = line.strip()
    if '<select id=' in line:
        print "true"

Now, whenever <select id= occurs, I want to get the id value. that is, copy the string from " after id= till another " occurs

how do I do this in python?

5
  • 7
    Please! BeautifulSoup: stackoverflow.com/questions/1732348/… Commented Apr 9, 2014 at 12:34
  • 1
    Or lxml, if you want a less awful parser. :P Commented Apr 9, 2014 at 12:36
  • re.findall('id=".*?"', line)[0][4:-1] yw... Commented Apr 9, 2014 at 12:54
  • 1
    @Wooble: You do know that BeautifulSoup uses pluggable parsers and that lxml, if installed, is the default, right? BeautifulSoup 4 is not about parsing (anymore) but about the object model. Which is pretty neat for most HTML tasks, really. Commented Apr 9, 2014 at 12:57
  • @Wooble: Use lxml if you want to use the ElementTree-on-steroids object model instead. Don't pick it because you think the parser might be better... Commented Apr 9, 2014 at 12:58

2 Answers 2

3

An html parser library is usually better at parsing html than raw string functions or regular expressions. Here's an example with the standard HTMLParser class:

html = """
<select id="my">
  <option id="1">1a</option>
  <option id="2">2bb</option>
</select>

<select id="my1">
  <option id="11">11a</option>
  <option id="21">21bb</option>
</select>
"""

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.ids = []

    def handle_starttag(self, tag, attrs):
        if tag == 'select':
            self.ids.extend(val for name, val in attrs if name == 'id')


p = MyParser()
p.feed(html)
print p.ids  # ['my', 'my1']
Sign up to request clarification or add additional context in comments.

Comments

0

BeautifulSoup4 has a very useful select method which makes possible to parse an html document with css selectors

Something like the following code (not tested sorry :-) ), should make possible to get all the ids of the select tags of an html document.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
tags = soup.select("select")
print [t.get("id", None) t for t in tags]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.