memory overflow when using numpy load in a loop

Question

Looping over npz files load causes memory overflow (depending on the file list length).

None of the following seems to help

Deleting the variable which stores the data in the file.
Using mmap.
calling gc.collect() (garbage collection).

The following code should reproduce the phenomenon:

import numpy as np

# generate a file for the demo
X = np.random.randn(1000,1000)
np.savez('tmp.npz',X=X)


# here come the overflow:
for i in xrange(1000000):
    data = np.load('tmp.npz')
    data.close()  # avoid the "too many files are open" error

in my real application the loop is over a list of files and the overflow exceeds 24GB of RAM! please note that this was tried on ubuntu 11.10, and for both numpy v 1.5.1 as well as 1.6.0

I have filed a report in numpy ticket 2048 but this may be of a wider interest and so I am posting it here as well (moreover, I am not sure that this is a bug but may result of my bad programming).

SOLUTION (by HYRY):

the command

del data.f

should precede the command

data.close()

for more information and a method to find the solution, please read HYRY's kind answer below

Do you have an actual question, or is this just a pseudo blog post? — talonmies
– talonmies, Commented Feb 11, 2012 at 22:00
@talonmies I am not sure what is meant by a pseudo blog post. I believe I specified quite clearly the question. As to its real importance for me? In case it is not solved it means I have to find a less elegant solution (like quitting the session and running the job on chunks of files). I have one directory of 3562 files. That was enough to overflow 24GB (the total of RAM I have). Another contains 4735 files. Neither can be processed using the load function as I used it in my original post. — eldad-a
– eldad-a, Commented Feb 12, 2012 at 0:57

HYRY · Accepted Answer · 2012-02-12 00:07:52Z

4

I think this is a bug, and maybe I found the solution: call "del data.f".

for i in xrange(10000000):
    data = np.load('tmp.npz')
    del data.f
    data.close()  # avoid the "too many files are open" error

to found this kind of memory leak. you can use the following code:

import numpy as np
import gc
# here come the overflow:
for i in xrange(10000):
    data = np.load('tmp.npz')
    data.close()  # avoid the "too many files are open" error

d = dict()
for o in gc.get_objects():
    name = type(o).__name__
    if name not in d:
        d[name] = 1
    else:
        d[name] += 1

items = d.items()
items.sort(key=lambda x:x[1])
for key, value in items:
    print key, value

After the test program, I created a dict and count objects in gc.get_objects(). Here is the output:

...
wrapper_descriptor 1382
function 2330
tuple 9117
BagObj 10000
NpzFile 10000
list 20288
dict 21001

From the result we know that there are something wrong with BagObj and NpzFile. Find the code:

class NpzFile(object):
    def __init__(self, fid, own_fid=False):
        ...
        self.zip = _zip
        self.f = BagObj(self)
        if own_fid:
            self.fid = fid
        else:
            self.fid = None

    def close(self):
        """
        Close the file.

        """
        if self.zip is not None:
            self.zip.close()
            self.zip = None
        if self.fid is not None:
            self.fid.close()
            self.fid = None

    def __del__(self):
        self.close()

class BagObj(object):
    def __init__(self, obj):
        self._obj = obj
    def __getattribute__(self, key):
        try:
            return object.__getattribute__(self, '_obj')[key]
        except KeyError:
            raise AttributeError, key

NpzFile has del(), NpzFile.f is a BagObj, and BagObj._obj is NpzFile, this is a reference cycle and will cause both NpzFile and BagObj uncollectable. Here is some explanation in Python document: http://docs.python.org/library/gc.html#gc.garbage

So, to break the reference cycle, will need to call "del data.f"

answered Feb 12, 2012 at 0:07

HYRY

97.8k28 gold badges197 silver badges192 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

eldad-a Over a year ago

I have not checked the method you proposed to find the source for the problem (I will try though), but the solution (del data.f) seems to solve it. Thank you so much - both for the solution (this is a relief) and the quick reply!

eldad-a Over a year ago

ok. read through the explanation of how to find a solution. I must admit it is somewhat advanced for me (looking through the gc and reading through the code). Nevertheless I will try this first the next time I have a similar problem. So again - Thanks a lot - it turned out more helpful than I expected! (got to learn a debugging method on the way ;-) BTW, any idea why would the NpzFile class would be coded in a cyclic referring manner?

Michael Over a year ago

This bug is fixed now. github.com/numpy/numpy/commit/…

Bhargav Rao · Accepted Answer · 2021-01-26 00:27:47Z

-1

What I found as the solution: (python==3.8 and numpy==1.18.5)

import gc # import garbage collector interface

for i in range(1000):
   data = np.load('tmp.npy')

   # process data

   del data
   gc.collect()

edited Jan 26, 2021 at 0:27

Bhargav Rao

52.6k29 gold badges130 silver badges142 bronze badges

answered Jan 25, 2021 at 23:23

Empathy

11 silver badge1 bronze badge

Collectives™ on Stack Overflow

memory overflow when using numpy load in a loop

SOLUTION (by HYRY):

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

SOLUTION (by HYRY):

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related