python memory error (there are enough available memory)

Question

I'm trying to count the occurrences of strings in text files. The text files look like this, and each file is about 200MB.

String1 30
String2 100
String3 23
String1 5
.....

I want to save the counts into dict.

count  = {}
for filename in os.listdir(path):
    if(filename.endswith("idx")):
        continue
    print filename  
    f = open(os.path.join(path, filename))
    for line in f:
        (s, cnt) = line[:-1].split("\t")
        if(s not in count):
            try:
                count[s] = 0 
            except MemoryError:
                print(len(count))
                exit()
        count[s] += int(cnt)  
    f.close()
    print(len(count))

I got memory error at count[s] = 0, but I still have much more available memory in my computer.
How do I resolve this problem? Thank you!

UPDATE: I copied the actual code here. My python version is 2.4.3, and the machine is running linux and has about 48G memory, but it only consumes less than 5G. the code stops at len(count)=44739243.

UPDATE2: The strings can be duplicated (not unique string), so I want to add up all the counts for the strings. The operation I want is just reading the count for each string. There are about 10M lines per each file, and I have more than 30 files. I expect the count is less than 100 billion.

UPDATE3: the OS is linux 2.6.18.

I assume the three lines above f.close() are incorrectly indented here? — Junuxx
– Junuxx, Commented Oct 3, 2012 at 18:54
getStringCount(line) just returns string and its count. looks like return line.split("\t") — user987654
– user987654, Commented Oct 3, 2012 at 21:04
Try print(len(count)) and print(sys.getsizeof(count)) after closing each file, just to get an idea of how big the dictionary gets. — Junuxx
– Junuxx, Commented Oct 3, 2012 at 21:23
unrelated to memory: you could use collections.defaultdict(int) to simplify your code — jfs
– jfs, Commented Oct 3, 2012 at 21:46

phihag · Accepted Answer · 2012-10-04 11:42:43Z

4

cPython 2.4 can have problems with large memory allocations, even on x64:

$ python2.4 -c "'a' * (2**31-1)"
Traceback (most recent call last):
  File "<string>", line 1, in ?
MemoryError
$ python2.5 -c "'a' * (2**31-1)"
$

Update to a recent python interpreter (like cPython 2.7) to get around these issues, and make sure to install a 64-bit version of the interpreter.

If the strings are of nontrivial size (i.e. longer than the <10 bytes in your example), you may also want to simply store their hashes instead, or even use a probabilistic (but way more efficient) storage like a bloom filter. To store their hashes, replace the file handling loop with

import hashlib
# ...
for line in f:
    s, cnt = line[:-1].split("\t")
    idx = hashlib.md5(s).digest()
    count[idx] = count.get(idx, 0) + int(cnt)
# ...

edited Oct 4, 2012 at 11:42

answered Oct 3, 2012 at 22:43

phihag

289k75 gold badges475 silver badges489 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Michael Lorton Over a year ago

MD5: not as secure as a secure hash; not as fast as a fast hash. For this purpose, use Python's built-in (fast, insecure) hash.

Michael · Accepted Answer · 2012-10-03 22:58:47Z

1

I'm not really sure why this crash happens. How long is the estimated average size of your strings? 44 million strings, if they are somewhat lengthy, you should maybe consider hashing them, as already suggested. The downside is, that you loose the option to list your unique keys, you can just check, if a string is in your data or not.

Concerning the memory limit already being hit at 5 GB, maybe it's related to your outdated python version. If you have the option to update, get 2.7. Same syntax (plus some extras), no issues. Well, I don't even know if the following code is still compatible with 2.4, maybe you have to kick out the with-statement again, at least this is how you would write it in 2.7.

The main difference to your version is to run garbage collection by hand. Additionally you can raise the memory limit, that python uses. As you mentioned, it only uses a small fraction of actual ram, so in case there is some strange default setting prohibiting it to grow larger, try this:

MEMORY_MB_MAX = 30000
import gc
import os
import resource
from collections import defaultdict
resource.setrlimit(resource.RLIMIT_AS, (MEMORY_MB_MAX * 1048576L, -1L))

count  = defaultdict(int)
for filename in os.listdir(path):
    if(filename.endswith("idx")):
        continue
    print filename  
    with open(os.path.join(path, filename)) as f:
        for line in f:
            s, cnt = line[:-1].split("\t")
            count[s] += int(cnt)  
    print(len(count))
    gc.collect()

Besides that, I don't get the meaning of your line s, cnt = line[:-1].split("\t"), especially the [:-1]. If the files look like you noted, then this would erase the last digits of your numbers. Is this on purpose?

edited Oct 3, 2012 at 22:58

answered Oct 3, 2012 at 22:47

Michael

7,8061 gold badge41 silver badges64 bronze badges

2 Comments

user987654 Over a year ago

Thank you for your answer! it works with python 2.7. The reason I used '[:-1]' is to remove '\n'.

Michael Over a year ago

It wouldn't matter. The iteration over the lines will remove that, and even if it would stay, the conversion to int kicks out any surrounding whitespace. Drop it, and the string doesn't have to be copied for every single line. Should be noticable, concerning the total amount of data.

Michael Lorton · Accepted Answer · 2012-10-04 20:37:26Z

1

If all you are trying to do is count the number of unique strings, you could hugely reduce your memory footprint by hashing each string:

    (s, cnt) = line[:-1].split("\t")
    s = hash(s)

edited Oct 4, 2012 at 20:37

answered Oct 3, 2012 at 22:43

Michael Lorton

44.7k26 gold badges111 silver badges159 bronze badges

Collectives™ on Stack Overflow

python memory error (there are enough available memory)

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related