Python: String counting memory error

Question

Based on the suggestions I received in this forum, I am using the following code (example) to count strings.

phrase_words = ['red car', 'no lake', 'newjersey turnpike']
lines = ['i have a red car which i drove on newjersey', 'turnpike. when i took exit 39 there was no', 'lake. i drove my car on muddy roads which turned my red', 'car into brown. driving on newjersey turnpike can be confusing.']
text = " ".join(lines)
dict = {phrase: text.count(phrase) for phrase in phrase_words}

The desired output and the output of the example code is:

{'newjersey turnpike': 2, 'red car': 2, 'no lake': 1}

This code worked great on a text file which was less than 300MB. I used a text file of size 500MB + and received the following memory error:

    y=' '.join(lines)
MemoryError

How do I overcome this? Thanks for your help!

You are trying to load too much into memory at once. You need to take a more iterated approach. What are you going to do with y? — cheeken
– cheeken, Commented Sep 19, 2011 at 18:16
Count the strings in 'y' which are present in phrase_words array! — Zenvega
– Zenvega, Commented Sep 19, 2011 at 18:20
If you want to inspect each line inside of lines, why are you joining them all into y? Why not just look at them individually? — cheeken
– cheeken, Commented Sep 19, 2011 at 18:25
docs.python.org/library/exceptions.html#exceptions.MemoryError . Is it that hard? If you can't code and have no ambitions to learn it, let - it - be. — robert
– robert, Commented Sep 19, 2011 at 18:25
Very interesting, nontrivial task. @cheeken: that's the whole point. Some patterns may show up only after joining the lines, as "newjersey turnpike" does in his example. — Johannes Charra
– Johannes Charra, Commented Sep 19, 2011 at 18:28

Steven Rumbalski · Accepted Answer · 2011-09-19 18:42:29Z

This algorithm needs only two lines in memory at a time. It assumes that no phrase will span three lines:

from itertools import tee, izip
from collections import defaultdict

def pairwise(iterable): # recipe from itertools docs
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)
d = defaultdict(int)
phrase_words = ['red car', 'no lake', 'newjersey turnpike']
lines = ['i have a red car which i drove on newjersey',
         'turnpike. when i took exit 39 there was no',
         'lake. i drove my car on muddy roads which turned my red',
         'car into brown. driving on newjersey turnpike can be confusing.']

for line1, line2 in pairwise(lines):
    both_lines= ' '.join((line1, line2))
    for phrase in phrase_words:
        # counts phrases in first line and those that span to the next
        d[phrase] += both_lines.count(phrase) - line2.count(phrase)
for phrase in phrase_words:
    d[phrase] += line2.count(phrase) # otherwise last line is not searched

Dave · Accepted Answer · 2011-09-19 19:38:21Z

3

You need to not try to do everything at once. Instead of loading a huge file into memory, and then parsing it, you should load the file a few (hundred) lines at a time, and try to find your strings within those lines.

As long as your chunks overlap at least max(len(phrase) for phrase in phrase_words) characters, you won't miss any hits.

So you could do something like:

text = ''
occurs = {}
overlap_size = max(len(phrase) for phrase in phrase_words)
for phrase in phrase_words:
    occurs[phrase] = 0

while lines:

    text += ' '.join(lines[:1000])
    lines = lines[100:]
    for phrase in phrase_words:
        # this makes sure we don't double count, and also gets all matches (probably)
        occurs[phrase] += text[overlap_size - len(phrase):].count(phrase)
    text = text[-1 * overlap_size:]

edited Sep 19, 2011 at 19:38

answered Sep 19, 2011 at 18:31

Dave

11.9k5 gold badges37 silver badges46 bronze badges

5 Comments

Johannes Charra Over a year ago

Correct. Still, counting the phrases is complicated. If you count before and after moving your "window" forward, how do you know which phrases you already counted? At the chunk edges, this gets tricky.

Dave Over a year ago

good point. You could make sure at least part of your phrase is in the new bit. I'll add some code for that.

Steven Rumbalski Over a year ago

Two problems I see: If a phrase occurs twice on one line it's only counted once. If the file starts with a phrase that is shorter than the overlap size it is not counted.

Steven Rumbalski Over a year ago

Should have said that if a phrase occurs twice in one chunk of lines it is only counted once.

Dave Over a year ago

True ok... what if we do something like this instead... (I've changed it to use count on a slice of the text)

Collectives™ on Stack Overflow

Python: String counting memory error

2 Answers 2

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related