Hi
I need filter out all rows that don't contain symbols from huge "necessary" list, example code:
def any_it(iterable):
for element in iterable:
if element: return True
return False
regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg
filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))
I have python 2.4, so I can't use built-in any().
I wait a long time for this filtering, but is there some way to optimize it? For example row 1 and 4 contains "RED.." pattern, if we found that "RED.." pattern is ok, can we skip search in 10000-members list for row 4 the same pattern??
Is there some another way to optimize filtering?
Thank you.
...edited...
UPD: See real example data in comments to this post. I'm also interested in sorting by "fruits" the result. Thanks!
...end edited...