4

Based on the suggestions I received in this forum, I am using the following code (example) to count strings.

phrase_words = ['red car', 'no lake', 'newjersey turnpike']
lines = ['i have a red car which i drove on newjersey', 'turnpike. when i took exit 39 there was no', 'lake. i drove my car on muddy roads which turned my red', 'car into brown. driving on newjersey turnpike can be confusing.']
text = " ".join(lines)
dict = {phrase: text.count(phrase) for phrase in phrase_words}

The desired output and the output of the example code is:

{'newjersey turnpike': 2, 'red car': 2, 'no lake': 1}

This code worked great on a text file which was less than 300MB. I used a text file of size 500MB + and received the following memory error:

    y=' '.join(lines)
MemoryError

How do I overcome this? Thanks for your help!

5
  • 2
    You are trying to load too much into memory at once. You need to take a more iterated approach. What are you going to do with y? Commented Sep 19, 2011 at 18:16
  • Count the strings in 'y' which are present in phrase_words array! Commented Sep 19, 2011 at 18:20
  • 1
    If you want to inspect each line inside of lines, why are you joining them all into y? Why not just look at them individually? Commented Sep 19, 2011 at 18:25
  • docs.python.org/library/exceptions.html#exceptions.MemoryError . Is it that hard? If you can't code and have no ambitions to learn it, let - it - be. Commented Sep 19, 2011 at 18:25
  • Very interesting, nontrivial task. @cheeken: that's the whole point. Some patterns may show up only after joining the lines, as "newjersey turnpike" does in his example. Commented Sep 19, 2011 at 18:28

2 Answers 2

5

This algorithm needs only two lines in memory at a time. It assumes that no phrase will span three lines:

from itertools import tee, izip
from collections import defaultdict

def pairwise(iterable): # recipe from itertools docs
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)
d = defaultdict(int)
phrase_words = ['red car', 'no lake', 'newjersey turnpike']
lines = ['i have a red car which i drove on newjersey',
         'turnpike. when i took exit 39 there was no',
         'lake. i drove my car on muddy roads which turned my red',
         'car into brown. driving on newjersey turnpike can be confusing.']

for line1, line2 in pairwise(lines):
    both_lines= ' '.join((line1, line2))
    for phrase in phrase_words:
        # counts phrases in first line and those that span to the next
        d[phrase] += both_lines.count(phrase) - line2.count(phrase)
for phrase in phrase_words:
    d[phrase] += line2.count(phrase) # otherwise last line is not searched
Sign up to request clarification or add additional context in comments.

Comments

3

You need to not try to do everything at once. Instead of loading a huge file into memory, and then parsing it, you should load the file a few (hundred) lines at a time, and try to find your strings within those lines.

As long as your chunks overlap at least max(len(phrase) for phrase in phrase_words) characters, you won't miss any hits.

So you could do something like:

text = ''
occurs = {}
overlap_size = max(len(phrase) for phrase in phrase_words)
for phrase in phrase_words:
    occurs[phrase] = 0

while lines:

    text += ' '.join(lines[:1000])
    lines = lines[100:]
    for phrase in phrase_words:
        # this makes sure we don't double count, and also gets all matches (probably)
        occurs[phrase] += text[overlap_size - len(phrase):].count(phrase)
    text = text[-1 * overlap_size:]

5 Comments

Correct. Still, counting the phrases is complicated. If you count before and after moving your "window" forward, how do you know which phrases you already counted? At the chunk edges, this gets tricky.
good point. You could make sure at least part of your phrase is in the new bit. I'll add some code for that.
Two problems I see: If a phrase occurs twice on one line it's only counted once. If the file starts with a phrase that is shorter than the overlap size it is not counted.
Should have said that if a phrase occurs twice in one chunk of lines it is only counted once.
True ok... what if we do something like this instead... (I've changed it to use count on a slice of the text)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.