5

Looping over npz files load causes memory overflow (depending on the file list length).

None of the following seems to help

  1. Deleting the variable which stores the data in the file.

  2. Using mmap.

  3. calling gc.collect() (garbage collection).

The following code should reproduce the phenomenon:

import numpy as np

# generate a file for the demo
X = np.random.randn(1000,1000)
np.savez('tmp.npz',X=X)


# here come the overflow:
for i in xrange(1000000):
    data = np.load('tmp.npz')
    data.close()  # avoid the "too many files are open" error

in my real application the loop is over a list of files and the overflow exceeds 24GB of RAM! please note that this was tried on ubuntu 11.10, and for both numpy v 1.5.1 as well as 1.6.0

I have filed a report in numpy ticket 2048 but this may be of a wider interest and so I am posting it here as well (moreover, I am not sure that this is a bug but may result of my bad programming).

SOLUTION (by HYRY):

the command

del data.f

should precede the command

data.close()

for more information and a method to find the solution, please read HYRY's kind answer below

2
  • Do you have an actual question, or is this just a pseudo blog post? Commented Feb 11, 2012 at 22:00
  • @talonmies I am not sure what is meant by a pseudo blog post. I believe I specified quite clearly the question. As to its real importance for me? In case it is not solved it means I have to find a less elegant solution (like quitting the session and running the job on chunks of files). I have one directory of 3562 files. That was enough to overflow 24GB (the total of RAM I have). Another contains 4735 files. Neither can be processed using the load function as I used it in my original post. Commented Feb 12, 2012 at 0:57

2 Answers 2

4

I think this is a bug, and maybe I found the solution: call "del data.f".

for i in xrange(10000000):
    data = np.load('tmp.npz')
    del data.f
    data.close()  # avoid the "too many files are open" error

to found this kind of memory leak. you can use the following code:

import numpy as np
import gc
# here come the overflow:
for i in xrange(10000):
    data = np.load('tmp.npz')
    data.close()  # avoid the "too many files are open" error

d = dict()
for o in gc.get_objects():
    name = type(o).__name__
    if name not in d:
        d[name] = 1
    else:
        d[name] += 1

items = d.items()
items.sort(key=lambda x:x[1])
for key, value in items:
    print key, value

After the test program, I created a dict and count objects in gc.get_objects(). Here is the output:

...
wrapper_descriptor 1382
function 2330
tuple 9117
BagObj 10000
NpzFile 10000
list 20288
dict 21001

From the result we know that there are something wrong with BagObj and NpzFile. Find the code:

class NpzFile(object):
    def __init__(self, fid, own_fid=False):
        ...
        self.zip = _zip
        self.f = BagObj(self)
        if own_fid:
            self.fid = fid
        else:
            self.fid = None

    def close(self):
        """
        Close the file.

        """
        if self.zip is not None:
            self.zip.close()
            self.zip = None
        if self.fid is not None:
            self.fid.close()
            self.fid = None

    def __del__(self):
        self.close()

class BagObj(object):
    def __init__(self, obj):
        self._obj = obj
    def __getattribute__(self, key):
        try:
            return object.__getattribute__(self, '_obj')[key]
        except KeyError:
            raise AttributeError, key

NpzFile has del(), NpzFile.f is a BagObj, and BagObj._obj is NpzFile, this is a reference cycle and will cause both NpzFile and BagObj uncollectable. Here is some explanation in Python document: http://docs.python.org/library/gc.html#gc.garbage

So, to break the reference cycle, will need to call "del data.f"

Sign up to request clarification or add additional context in comments.

3 Comments

I have not checked the method you proposed to find the source for the problem (I will try though), but the solution (del data.f) seems to solve it. Thank you so much - both for the solution (this is a relief) and the quick reply!
ok. read through the explanation of how to find a solution. I must admit it is somewhat advanced for me (looking through the gc and reading through the code). Nevertheless I will try this first the next time I have a similar problem. So again - Thanks a lot - it turned out more helpful than I expected! (got to learn a debugging method on the way ;-) BTW, any idea why would the NpzFile class would be coded in a cyclic referring manner?
This bug is fixed now. github.com/numpy/numpy/commit/…
-1

What I found as the solution: (python==3.8 and numpy==1.18.5)

import gc # import garbage collector interface

for i in range(1000):
   data = np.load('tmp.npy')

   # process data

   del data
   gc.collect()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.