Smart filter with python

Question

Hi
I need filter out all rows that don't contain symbols from huge "necessary" list, example code:

def any_it(iterable):
      for element in iterable:
          if element: return True
      return False

regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()

## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg

filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))

I have python 2.4, so I can't use built-in any().
I wait a long time for this filtering, but is there some way to optimize it? For example row 1 and 4 contains "RED.." pattern, if we found that "RED.." pattern is ok, can we skip search in 10000-members list for row 4 the same pattern??
Is there some another way to optimize filtering?
Thank you.
...edited...
UPD: See real example data in comments to this post. I'm also interested in sorting by "fruits" the result. Thanks!
...end edited...

@DominiCane: Can you provide appropriately-sized representative data sets? There may be optimization paths that we can't anticipate because we're not familiar with your data. — MattH
– MattH, Commented Dec 9, 2010 at 13:40
What exactly can I provide? This example is close enough to real situation. This filtering is part of generators chain, that modifies rows in the file. I can't think about the closer example... Guide me what it lacks here.. — DominiCane
– DominiCane, Commented Dec 9, 2010 at 15:52
Quantity. Real filter items. There are less than 200 different words for colour I could find in the English language. — MattH
– MattH, Commented Dec 9, 2010 at 15:56
Files about one-five millions of rows. Of course not with colors, but with exchange market symbols like "QZF10", "ZT F1:H2" and other lovely strings. The real example row is: msgType=QuoteMsg conType=call exch=206 sym=OZFH1 Strike=003 12000 fast=normal Quote=026 480 TckSiz=25 SalesCond=Ask BateModifier=Explicit — DominiCane
– DominiCane, Commented Dec 9, 2010 at 17:08
How about a sample 100 filter items, 1000 lines of data and the correctly filtered output on pastebin or something? — MattH
– MattH, Commented Dec 10, 2010 at 12:13

Zach Hirsch · Accepted Answer · 2010-12-09 13:14:34Z

5

If you organized the necessary list as a trie, then you could look in that trie to check if the fruit starts with a valid prefix. That should be faster than comparing the fruit against every prefix.

For example (only mildly tested):

import bisect
import re

class Node(object):
    def __init__(self):
        self.children = []
        self.children_values = []
        self.exists = False

    # Based on code at http://docs.python.org/library/bisect.html                
    def _index_of(self, ch):
        i = bisect.bisect_left(self.children_values, ch)
        if i != len(self.children_values) and self.children_values[i] == ch:
            return (i, self.children[i])
        return (i, None)

    def add(self, value):
        if len(value) == 0:
            self.exists = True
            return
        i, child = self._index_of(value[0])
        if not child:
            child = Node()
            self.children.insert(i, child)
            self.children_values.insert(i, value[0])
        child.add(value[1:])

    def contains_prefix_of(self, value):
        if self.exists:
            return True
        i, child = self._index_of(value[0])
        if not child:
            return False
        return child.contains_prefix_of(value[1:])

necessary = ['RED', 'GREEN', 'BLUE', 'ORANGE', 'BLACK',
             'LIGHTRED', 'LIGHTGREEN', 'GRAY']

trie = Node()
for value in necessary:
    trie.add(value)

# Find lines that match values in the trie
filtered = []
regexp = re.compile(r'fruit=([A-Z]+)')
for line in open('whatever-file'):
    fruit = regexp.findall(line)[0]
    if trie.contains_prefix_of(fruit):
        filtered.append(line)

This changes your algorithm from O(N * k), where N is the number of elements of necessary and k is the length of fruit, to just O(k) (more or less). It does take more memory though, but that might be a worthwhile trade-off for your case.

answered Dec 9, 2010 at 13:14

Zach Hirsch

26.5k8 gold badges35 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Jochen Ritzel Over a year ago

+1 I'm sure this will give by far the biggest speedup. In the original code, if a line would not match the script had to go trough every word in nessesary. You can find some tested implementations for this too, it's called a patricia or radix tree

Shawn Chin Over a year ago

+1 import bisect is new to me! Definitely the fastest approach suggested so far.

Zach Hirsch Over a year ago

@Shawn Yep, and as soon as I submitted this I realized that I could've just used a dict instead of a sorted list, which might be faster due to O(1) lookup instead of O(log(m)) at each Node. But since you commented I'll leave it as an example of how to use bisect :-)

DominiCane Over a year ago

@Zach Hirsch, trying your solution now.. What do you mean by using dict and not sorted list?? Can you explain, please?

DominiCane Over a year ago

@Zach Hirsch, another off-topic question: I need in future to sort this file lines by "fruit"s, could bisect help do it faster than sorted() with key=regexp()?

|

Community · Accepted Answer · 2017-05-23 12:04:14Z

1

I'm convinced Zach's answer is on the right track. Out of curiosity, I've implemented another version (incorporating Zach's comments about using a dict instead of bisect) and folded it into a solution that matches your example.

#!/usr/bin/env python
import re
from trieMatch import PrefixMatch # https://gist.github.com/736416

pm = PrefixMatch(['YELLOW', 'GREEN', 'RED', ]) # huge list of 10 000 members
# if list is static, it might be worth picking "pm" to avoid rebuilding each time

f = open("huge_file.txt", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()

regexp = re.compile(r'^.*?fruit=([A-Z]+)')
filtered = (line for line in lines if pm.match(regexp.match(line).group(1)))

For brevity, implementation of PrefixMatch is published here.

If your list of necessary prefixes is static or changes infrequently, you can speed up subsequent runs by pickling and reusing the PickleMatch object instead of rebuilding it each time.

update (on sorted results)

According to the changelog for Python 2.4:

key should be a single-parameter function that takes a list element and returns a comparison key for the element. The list is then sorted using the comparison keys.

also, in the source code, line 1792:

/* Special wrapper to support stable sorting using the decorate-sort-undecorate
   pattern.  Holds a key which is used for comparisons and the original record
   which is returned during the undecorate phase.  By exposing only the key
   .... */

This means that your regex pattern is only evaluated once for each entry (not once for each compare), hence it should not be too expensive to do:

sorted_generator = sorted(filtered, key=regexp.match(line).group(1))

edited May 23, 2017 at 12:04

CommunityBot

11 silver badge

answered Dec 10, 2010 at 16:44

Shawn Chin

87.4k20 gold badges168 silver badges193 bronze badges

4 Comments

DominiCane Over a year ago

thanks! Filtering works perfectly! But sorting, surprisingly, is slower that my old "sorted(generator, key=lambda x: regexp.findall(x)[0])". Currently investigating why..

Shawn Chin Over a year ago

Does it help if you use sort(lambda x,y:cmp(x[0],y[0])) on the mapped list instead of a straight up sort()?

Shawn Chin Over a year ago

According to this answer (stackoverflow.com/questions/463032//464815#464815) key=.. for sort/sorted would already perform map-sort-unmap, so no need to do it yourself. Some of the benchmarks I've run seem to suggest the same. Will remove sorting bit from my answer.

Shawn Chin Over a year ago

Found proof that as of Python 2.4, sorted(key=...) already does decorate-sort-undecorate. Answer updated.

knifenomad · Accepted Answer · 2010-12-09 12:47:06Z

1

I personally like your code as is since you consider "fruit=COLOR" as a pattern which others does not. I think you want to find some solution like memoization which enables you to skip test for already solved problem but this is not the case I guess.

def any_it(iterable): for element in iterable: if element: return True return False

necessary = ['YELLOW', 'GREEN', 'RED', ...]

predicate = lambda line: any_it("fruit=" + color in line for color in necessary)

filtered = ifilter(predicate, open("testest"))

answered Dec 9, 2010 at 12:47

knifenomad

213 bronze badges

1 Comment

DominiCane Over a year ago

Yes, string operations is much faster

Shawn Chin · Accepted Answer · 2010-12-09 14:13:27Z

1

Tested (but unbenchmarked) code:

import re
import fileinput

regexp = re.compile(r'^.*?fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ]

filtered = []
for line in fileinput.input(["test.txt"]):
    try:
        key = regexp.match(line).group(1)
    except AttributeError:
        continue # no match
    for p in necessary:
        if key.startswith(p):
            filtered.append(line)
            break

# "filtered" now holds your results
print "".join(filtered)

Diff to code in question:

We do not first load the whole file into memory (as is done when you use file.readlines()). Instead, we process each line as the file is read in. I use the fileinput module here for brevity, but one can also use line = file.readline() and a while line: loop.
We stop iterating through the necessary list once a match is found.
We modified the regex pattern and use re.match instead of re.findall. That's assuming that each line would only contain one "fruit=..." entry.

update

If the format of the input file is consistent, you can squeeze out a little more performance by getting rid of regex altogether.

try:
    # with line = "2 asdasd fruit=SOMETHING asdasd...."
    key = line.split(" ", 3)[2].split("=")[1]
except:
    continue # no match

edited Dec 9, 2010 at 14:13

answered Dec 9, 2010 at 12:12

Shawn Chin

87.4k20 gold badges168 silver badges193 bronze badges

1 Comment

DominiCane Over a year ago

Thanks, I got it. I didn't mention that this generator is only one in the chain of generators, so reading the file is not the bottleneck here. About 2 - function any_it() does the same. And 3 is useful for me here, thanks!

dugres · Accepted Answer · 2010-12-09 14:20:40Z

1

filtered=[]
for line in open('huge_file'):
    found=regexp.findall(line)
    if found:
        fruit=found[0]
        for x in necessary:
            if fruit.startswith(x):
                filtered.append(line)
                break

or maybe :

necessary=['fruit=%s'%x for x in necessary]
filtered=[]
for line in open('huge_file'):
    for x in necessary:
        if x in line:
            filtered.append(line)
            break

edited Dec 9, 2010 at 14:20

answered Dec 9, 2010 at 12:29

dugres

13.2k8 gold badges48 silver badges52 bronze badges

3 Comments

DominiCane Over a year ago

I think any_it() function does same thing, no?

dugres Over a year ago

@DominiCane : any_it is fine, it's simplier without it tough. In your version regexp.findall(line)[0] is called len(necessary) times for each line.

DominiCane Over a year ago

I think regexp.findall(line)[0] is called until we get True from any_it(), not len(necessary). But your're right about extracting regexp every step, it's slow and unnecessary, thanks.

Thomas K · Accepted Answer · 2010-12-09 14:22:01Z

1

I'd make a simple list of ['fruit=RED','fruit=GREEN'... etc. with ['fruit='+n for n in necessary], then use in rather than a regex to test them. I don't think there's any way to do it really quickly, though.

filtered = (line for line in f if any(a in line for a in necessary_simple))

(The any() function is doing the same thing as your any_it() function)

Oh, and get rid of file.readlines(), just iterate over the file.

edited Dec 9, 2010 at 14:22

answered Dec 9, 2010 at 12:12

Thomas K

40.7k7 gold badges88 silver badges89 bronze badges

2 Comments

DominiCane Over a year ago

Thanks, nice and simple. Strings "in" operator is really faster. But we anyway should use any_it() function, so finally: filtered = (line for line in f if any_it(a in line for a in necessary_simple)). Other way we get "if [true, true, false, false, true,...]" and it's always True. And in your version "a" variable is referenced before assignment. have you solution without "any()"?

Thomas K Over a year ago

@DominiCane: Well spotted, I should have tested it. I think any() is the neatest way to do it: your any_it function shouldn't be much slower.

Lennart Regebro · Accepted Answer · 2010-12-09 12:10:21Z

0

Untested code:

filtered = []
for line in lines:
    value = line.split('=', 1)[1].split(' ',1)[0]
    if value not in necessary:
        filtered.append(line)

That should be faster than pattern matching 10 000 patterns onto a line. Possibly there are even faster ways. :)

answered Dec 9, 2010 at 12:10

Lennart Regebro

173k45 gold badges230 silver badges254 bronze badges

4 Comments

Mattias Nilsson Over a year ago

If the necessary list is big, wouldn't a set be a better choice for it?

MattH Over a year ago

My understanding was the that value only has to start with any item in necessary, not be equal to it.

DominiCane Over a year ago

the problem is that value contains part of pattern. "REDSOMETHING" is not in ["RED", "GREEN"], and I can't extract "RED" from "REDSOMETHING", I don't know the length

Lennart Regebro Over a year ago

Ah, I see. Yeah, using "startsiwth" is not necessarily faster than pattern matching, so re might be the best solution here. And yeah, that will be slow.

Symbol · Accepted Answer · 2010-12-09 12:12:17Z

0

It shouldn't take too long to iterate through 100,000 strings, but I see you have a 10,000 strings list, which means you iterate 10,000 * 100,000 = 1,000,000,000 times the strings, so I don't know what did you expect... As for your question, if you encounter a word from the list and you only need 1 or more (if you want exacly 1 you need to iterate through the whole list) you can skip the rest, it should optimize the search operation.

answered Dec 9, 2010 at 12:12

Symbol

1453 silver badges10 bronze badges

Collectives™ on Stack Overflow

Smart filter with python

8 Answers 8

7 Comments

update (on sorted results)

4 Comments

1 Comment

update

1 Comment

3 Comments

2 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

7 Comments

update (on sorted results)

4 Comments

1 Comment

update

1 Comment

3 Comments

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related