2

For my Information Retrieval class I have to make an index of terms from a group of files. Valid terms contain an alphabetical character, so to test I just made a simple function and use an if/then control statement. Thus far I have:

ALPHA = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

def content_test(term):
    for a in ALPHA:
        if a in term:
            return True
    return False

class FileRead():

    def __init__(self, filename):
        f = open(filename, 'r')
        content = f.read()
        self.terms = content.split()

    def clean(self):
        for term in self.terms:
            if content_test(term) is False:
                try:
                    terms.remove(term)
                except:
                    pass

Now this all works fine (I think...) however I've been trying to learn more higher level python and I can't help but think that there is a more pythonic way of checking term validity (maybe using map(), or a lambda function?).

Am I correct or am I just overthinking it?

1
  • Small cleanup that you might find handy. import string; ALPHA = string.lowercase. Commented Feb 16, 2012 at 18:03

4 Answers 4

2

You can start by simplifying content_test():

def content_test(term):
    return any(c.isalpha() for c in term)

In fact, that's simple enough that you don't really need a separate function for it anymore.

What I'd do in this case is write a generator that yields only valid terms from the file. Then just convert that to a list using the list() constructor. This way you can read just a line at a time, which will save you a good bit of memory if the files are large.

def read_valid_terms(filename):
    with open(filename) as f:
        for line in f:
            for term in line.split():
                if any(c.isalpha() for c in term):
                    yield term

terms = list(read_valid_terms("terms.txt"))

Or if you are just going to iterate over the terms anyway, and only once, then just do that directly rather than making a list:

for term in read_valid_terms("terms.txt"):
    print term,
print
Sign up to request clarification or add additional context in comments.

2 Comments

Instead of using the various quirks of the print statement, I'd rather suggest the print function in the last two lines.
Eh, just a quickie demo, what's inside the for loop is immaterial to the example.
1

In Python, string objects already contain a method that does that for you:

>>> "abc".isalpha()
True
>>> "abc22".isalpha()
False

4 Comments

We need islower here too =D
The problem is valid terms can contain non alpha chars (just not exclusively)
Ah, that wasn't clear. So you mean that it just needs to contain a single alphabetic character someplace in the string?
Yes, exactly (comment length)
1

While you could use a regular expression, a pythonic way would be to use any:

import string
def content_test(term):
    return any((c in string.ascii_lowercase) for c in term)

If you also want to allow upper-case and locale-dependent characters, you can use str.isalpha.

A couple of additional notes:

  • FileRead should inherit from object, to make sure it's a new-style class.
  • Instead of writing if content_test(term) is False:, you can simply write if not content_test(term):.
  • clean can be written a lot, ahem, cleaner, by using filter:

def clean(self):
    self.terms = filter(content_test, self.terms)
  • You're not closing the file f, and may therefore leak the handle. Use the with statement to automatically close it, like this:

with open(filename, 'r') as f:
    content = f.read()
    self.terms = content.split()

1 Comment

I'm selecting kindall's answer as the correct one, but your's still helped a lot, appreciate it
0

Using regular expressions:

import re

# Match any number of non-whitespace characters, with an alpha char in it.
terms = re.findall('\S*[a-zA-Z]\S*', content)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.