File data to array is using a lot of memory

Question

I'm taking a large text file with tab separated values and adding them to an array.

When I run my code on a 32 Mb file, python memory consumption goes through the roof; using around 500 Mb RAM.

I need to be able to run this code for a 2 GB file, and possibly even larger files.

My current code is:

markers = []

def parseZeroIndex():
    with open('chromosomedata') as zeroIndexes:
        for line in zeroIndexes:
            markers.append(line.split('\t'))

parseZeroIndex()

Running this code against my 2 GB file is not possible as is. The files look like this:

per1    1029292 string1 euqye
per1    1029292 string2 euqys

My questions are:

What is using all this memory?

What is a more efficient way to do this memory wise?

First of all, you should always use the csv module. It will probably better handle how the file is read and cached. — Cilyan
– Cilyan, Commented Jul 2, 2016 at 21:20
Is it possible to share the 32mb file? There are probably much better ways to do what you but without knowing exactly what that is then it is going to be hard to suggest a significantly better approach. — Padraic Cunningham
– Padraic Cunningham, Commented Jul 2, 2016 at 22:02

Stefan Pochmann · Accepted Answer · 2016-07-02 23:33:35Z

6

"What is using all this memory?"

There's overhead for Python objects. See how many bytes some strings actually take:

Python 2:

>>> import sys
>>> map(sys.getsizeof, ('', 'a', u'ä'))
[21, 22, 28]

Python 3:

>>> import sys
>>> list(map(sys.getsizeof, ('', 'a', 'ä')))
[25, 26, 38]

"What is a more efficient way to do this memory wise?"

In comments you said there are lots of duplicate values, so string interning (storing only one copy of each distinct string value) might help a lot. Try this:

Python 2:

            markers.append(map(intern, line.rstrip().split('\t')))

Python 3:

            markers.append(list(map(sys.intern, line.rstrip().split('\t'))))

Note I also used line.rstrip() to remove the trailing \n from the line.

Experiment

I tried

>>> x = [str(i % 1000) for i in range(10**7)]

and

>>> import sys
>>> x = [sys.intern(str(i % 1000)) for i in range(10**7)]

in Python 3. The first one takes 355 MB (looking at the process in Windows Task Manager). The second one takes only 47 MB. Furthermore:

>>> sys.getsizeof(x)
40764032
>>> sum(map(sys.getsizeof, x[:1000]))
27890

So 40 MB is for the list referencing the strings (no surprise, as there are ten million references of four bytes each). And the strings themselves total only 27 KB.

Further improvements

As seen in the experiment, much of your RAM usage might be not from the strings but from your list object(s). Both your markers list object as well as all those list objects representing your rows. Especially if you're using 64-bit Python, which I suspect you do.

To reduce that overhead, you could use tuples instead of lists for your rows, as they're more light-weight:

sys.getsizeof(['a', 'b', 'c'])
48
>>> sys.getsizeof(('a', 'b', 'c'))
40

I estimate your 2 GB file has 80 million rows, so that would save 640 MB RAM. Perhaps more if you run 64-bit Python.

Another idea: If all your rows have the same number of values (I assume three), then you could ditch those 80 million row list objects and use a one-dimensional list of the 240 million string values instead. You'd just have to access it with markers[3*i+j] instead of markers[i][j]. And it could save a few GB RAM.

edited Jul 2, 2016 at 23:33

answered Jul 2, 2016 at 21:46

Stefan Pochmann

29k9 gold badges48 silver badges117 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

hallizh Over a year ago

This improved my solution significantly, thank you very much :) Still looking for ways to improve it though, 2 GB file uses around 8 GB of RAM at the moment (was approx 20 GB).

Stefan Pochmann Over a year ago

@hallizh Check out the experiment I added at the end and notice that my list takes a lot of memory, too. From your example data I estimate that you have 80 million rows, and sys.getsizeof(['a', 'b', 'c']) shows that a list with three string references takes 48 bytes. And 80 million times 48 bytes is about 3.6 GB.

Stefan Pochmann Over a year ago

@hallizh Oh and... please tell me what you get for sys.getsizeof(['a', 'b', 'c']). Given that you're able to use 8 GB, I suspect you might be running 64-bit Python?

Stefan Pochmann Over a year ago

@hallizh If your rows don't need to be lists, you could use tuples instead: markers.append(tuple(...)). Tuples are more light-weight, a three-tuple takes me 40 bytes instead of the list's 48 bytes.

Stefan Pochmann Over a year ago

@hallizh Another idea: If all your rows have the same number of values (I assume three), then you could ditch those ~80 million row list objects and use a one-dimensional list of ~240 million instead. You'd just have to access it with markers[3*i+j] instead of markers[i][j]. And it could save a few GB RAM.

|

Collectives™ on Stack Overflow

File data to array is using a lot of memory

1 Answer 1

"What is using all this memory?"

"What is a more efficient way to do this memory wise?"

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

"What is using all this memory?"

"What is a more efficient way to do this memory wise?"

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related