To test the overhead of the data structure in your code, I wrote the following test program. It assumes your text file was N megabytes in ASCII encoding, with relatively short lines. (I had to change N from 450 to 150 after my physical memory ran out.)
import sys
MB = 1024 * 1024
line = "the quick brown fox jumps over the lazy dog"
megs = 150
nlines = (megs * MB) / len(line)
d = {}
for i in xrange(nlines):
d[i] = line.split(' ')
dict_size = sys.getsizeof(d)
list_size = sum(sys.getsizeof(a) for a in d.items())
item_size = sum(sum(sys.getsizeof(s) for s in a) for a in d.items())
print " dict:", dict_size / float(MB), "MB"
print "lists:", list_size / float(MB), "MB"
print "items:", item_size / float(MB), "MB"
print "total:", (dict_size + list_size + item_size) / float(MB), "MB"
with the result:
dict: 192.00 MB
lists: 251.16 MB
items: 669.77 MB
total: 1112.9 MB
watching the Activity Monitor, the Python process exceeds 2 gigabytes of memory usage, so there is also some memory not accounted for. Artifacts of the malloc implementation might be a possibility.
I implemented the same program in C++:
#include <string>
#include <vector>
#include <unordered_map>
int main()
{
int const MB = 1024 * 1024;
std::string const line = "the quick brown fox jumps over the lazy dog";
std::vector<std::string> const split = {
"the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"
};
int const megs = 150;
int const nlines = (megs * MB) / line.size();
std::unordered_map<int, std::vector<std::string>> d;
for (int i = 0; i < nlines; ++i) {
d[i] = split;
}
}
Compiled with clang++ -O3, this used about 1GB of memory. C++ doesn't have sys.getsizeof() so it requires a bit more work to break down the memory usage, and I didn't do that work.
Twice the memory of the equivalent C++ is actually a pretty good result for Python, so I'm removing my pre-edit comments about the cPython implementation.
I think your main problem is storing the line as an array of short strings. Is it possible for you to store the lines as whole strings and split them as needed, but not all at once?
What is the end goal of your program?