3

I'm taking a large text file with tab separated values and adding them to an array.

When I run my code on a 32 Mb file, python memory consumption goes through the roof; using around 500 Mb RAM.

I need to be able to run this code for a 2 GB file, and possibly even larger files.

My current code is:

markers = []

def parseZeroIndex():
    with open('chromosomedata') as zeroIndexes:
        for line in zeroIndexes:
            markers.append(line.split('\t'))

parseZeroIndex()

Running this code against my 2 GB file is not possible as is. The files look like this:

per1    1029292 string1 euqye
per1    1029292 string2 euqys

My questions are:

What is using all this memory?

What is a more efficient way to do this memory wise?

16
  • First of all, you should always use the csv module. It will probably better handle how the file is read and cached. Commented Jul 2, 2016 at 21:20
  • 1
    Do you really need all the data stored together at once ? Commented Jul 2, 2016 at 21:24
  • 1
    @hallizh Are your values all strings? Commented Jul 2, 2016 at 21:55
  • 1
    Is it possible to share the 32mb file? There are probably much better ways to do what you but without knowing exactly what that is then it is going to be hard to suggest a significantly better approach. Commented Jul 2, 2016 at 22:02
  • 1
    @hallizh, what is the end goal? Commented Jul 2, 2016 at 22:11

1 Answer 1

6

"What is using all this memory?"

There's overhead for Python objects. See how many bytes some strings actually take:

Python 2:

>>> import sys
>>> map(sys.getsizeof, ('', 'a', u'ä'))
[21, 22, 28]

Python 3:

>>> import sys
>>> list(map(sys.getsizeof, ('', 'a', 'ä')))
[25, 26, 38]

"What is a more efficient way to do this memory wise?"

In comments you said there are lots of duplicate values, so string interning (storing only one copy of each distinct string value) might help a lot. Try this:

Python 2:

            markers.append(map(intern, line.rstrip().split('\t')))

Python 3:

            markers.append(list(map(sys.intern, line.rstrip().split('\t'))))

Note I also used line.rstrip() to remove the trailing \n from the line.


Experiment

I tried

>>> x = [str(i % 1000) for i in range(10**7)]

and

>>> import sys
>>> x = [sys.intern(str(i % 1000)) for i in range(10**7)]

in Python 3. The first one takes 355 MB (looking at the process in Windows Task Manager). The second one takes only 47 MB. Furthermore:

>>> sys.getsizeof(x)
40764032
>>> sum(map(sys.getsizeof, x[:1000]))
27890

So 40 MB is for the list referencing the strings (no surprise, as there are ten million references of four bytes each). And the strings themselves total only 27 KB.


Further improvements

As seen in the experiment, much of your RAM usage might be not from the strings but from your list object(s). Both your markers list object as well as all those list objects representing your rows. Especially if you're using 64-bit Python, which I suspect you do.

To reduce that overhead, you could use tuples instead of lists for your rows, as they're more light-weight:

sys.getsizeof(['a', 'b', 'c'])
48
>>> sys.getsizeof(('a', 'b', 'c'))
40

I estimate your 2 GB file has 80 million rows, so that would save 640 MB RAM. Perhaps more if you run 64-bit Python.

Another idea: If all your rows have the same number of values (I assume three), then you could ditch those 80 million row list objects and use a one-dimensional list of the 240 million string values instead. You'd just have to access it with markers[3*i+j] instead of markers[i][j]. And it could save a few GB RAM.

Sign up to request clarification or add additional context in comments.

7 Comments

This improved my solution significantly, thank you very much :) Still looking for ways to improve it though, 2 GB file uses around 8 GB of RAM at the moment (was approx 20 GB).
@hallizh Check out the experiment I added at the end and notice that my list takes a lot of memory, too. From your example data I estimate that you have 80 million rows, and sys.getsizeof(['a', 'b', 'c']) shows that a list with three string references takes 48 bytes. And 80 million times 48 bytes is about 3.6 GB.
@hallizh Oh and... please tell me what you get for sys.getsizeof(['a', 'b', 'c']). Given that you're able to use 8 GB, I suspect you might be running 64-bit Python?
@hallizh If your rows don't need to be lists, you could use tuples instead: markers.append(tuple(...)). Tuples are more light-weight, a three-tuple takes me 40 bytes instead of the list's 48 bytes.
@hallizh Another idea: If all your rows have the same number of values (I assume three), then you could ditch those ~80 million row list objects and use a one-dimensional list of ~240 million instead. You'd just have to access it with markers[3*i+j] instead of markers[i][j]. And it could save a few GB RAM.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.