I'm trying to count the occurrences of strings in text files. The text files look like this, and each file is about 200MB.
String1 30
String2 100
String3 23
String1 5
.....
I want to save the counts into dict.
count = {}
for filename in os.listdir(path):
if(filename.endswith("idx")):
continue
print filename
f = open(os.path.join(path, filename))
for line in f:
(s, cnt) = line[:-1].split("\t")
if(s not in count):
try:
count[s] = 0
except MemoryError:
print(len(count))
exit()
count[s] += int(cnt)
f.close()
print(len(count))
I got memory error at count[s] = 0,
but I still have much more available memory in my computer.
How do I resolve this problem?
Thank you!
UPDATE:
I copied the actual code here.
My python version is 2.4.3, and the machine is running linux and has about 48G memory, but it only consumes less than 5G. the code stops at len(count)=44739243.
UPDATE2: The strings can be duplicated (not unique string), so I want to add up all the counts for the strings. The operation I want is just reading the count for each string. There are about 10M lines per each file, and I have more than 30 files. I expect the count is less than 100 billion.
UPDATE3: the OS is linux 2.6.18.
getStringCountlook like.f.close()are incorrectly indented here?print(len(count))andprint(sys.getsizeof(count))after closing each file, just to get an idea of how big the dictionary gets.