Extract strings in python

Question

Basically, I want to extract the strings "AAA", "BBB", "CCC", "DDD" from a text file...

...... (other text goes here).....
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
..... (useless text here).....
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
....(more text).....
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
......(more text).....

I want something like if I do:-

data = foo("file.txt")

I get:-

data = ['AAA','BBB','CCC','DDD']

What is the best possible way? My file is not big...

Basically, I want to extract "remaining upload data transfer" from this file which in HTML looks like THIS

Oli · Accepted Answer · 2010-03-17 17:48:54Z

2

You could write a REGEX but it would be "parsing" the HTML to some extent. The problem with writing regular expressions for HTML is HTML is a mess. It's rarely perfect and this causes problems when you rely on it for data.

I would personally use BeautifulSoup. It does do more than you're asking but also at superfraction of the effort.

answered Mar 17, 2010 at 17:48

Oli

241k67 gold badges227 silver badges305 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dominic Rodger · Accepted Answer · 2010-03-17 17:40:08Z

0

You want BeautifulSoup:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_file)

soup.find("font", "textfont")

answered Mar 17, 2010 at 17:40

Dominic Rodger

100k37 gold badges204 silver badges219 bronze badges

3 Comments

shadyabhi Over a year ago

I want to do it without using a third party library.. Bcos, I dont really want html processing.. My aim is just to extract those strings..

Mike Graham Over a year ago

@shadyabhi, Not using a library is a silly goal. An HTML parser is the right tool for what you are trying to do (which is parsing HTML) and provides a way to write a simple, concise function.

Mike Graham Over a year ago

@Dominic, lxml is probably a better choice these days, as it is still actively developed.

inspectorG4dget · Accepted Answer · 2010-03-17 17:50:55Z

0

def foo():
    input_file = open("myfile.txt", 'r')
    input = ''.join(input_file.readlines())

    looking_for = ['AAA', 'BBB', 'CCC', 'DDD']
    have = []

    for thing in looking_for:
        if thing in input:
            have.append(thing)
    return have

answered Mar 17, 2010 at 17:50

inspectorG4dget

115k30 gold badges159 silver badges253 bronze badges

2 Comments

fortran Over a year ago

I think that won't present the ordering if more than one item is present in the same line...

inspectorG4dget Over a year ago

I don't know what you mean by "ordering". I see no such specification in the question. And my algorithm will find all the strings in looking_for that are in the html, even if they are in the same line.

zellio · Accepted Answer · 2010-03-17 17:51:17Z

0

In a case like this it's, attempt regex for it ( which will be really had ), use a prewritten library, or do it your self with a f = open() f.read() and your own parser.

answered Mar 17, 2010 at 17:51

zellio

32.8k1 gold badge46 silver badges64 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 11:48:22Z

If you just want to get the data from inside all of the tags in the HTML document, while dropping all the tags themselves, you could do something like this:

import HTMLParser

class DataOnlyParser(HTMLParser.HTMLParser):
    def parse(self, text):
        self.result = []
        self.feed(text)
        self.close()
        return self.result

    def handle_data(self, data):
        data = data.strip()
        if data:
            self.result.append(data)

p = DataOnlyParser()

data = """
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
"""

print p.parse(data)
# ['AAA', 'BBB', 'CCC', 'DDD']

If your selection criteria is more complex though, and/or if the input is malformed, you'd probably be better off with a library like lxml.

You do NOT want to use regular expressions to "parse" html. See here.

Collectives™ on Stack Overflow

Extract strings in python

5 Answers 5

Comments

3 Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

3 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related