2

I have a list with 2940 elements - each element is a (60, 2094) numpy array.

print('DataX:')
print('len:')
print(len(dataX))
print('shape:')
for i in range(5):
    print(dataX[i].shape)
print('dtype:')
print(dataX[0].dtype)

print('size',sys.getsizeof(dataX)/1000000)

results in :

DataX:
len:
2940
shape:
(60, 2094)
(60, 2094)
(60, 2094)
(60, 2094)
(60, 2094)
dtype:
float64
size 0.023728

However, if I try to turn this in to a numpy array (which should result in a shape of (2940, 60, 2094), the size of the array is much, much larger.

#convert list to array

X = np.array(dataX)
print('X:')
print('shape', X.shape)
print('size',sys.getsizeof(X)/1000000)

Output:

DataX:
shape (2940, 60, 2094)
size 2955.052928

Why is this the case?

If I try it with a bigger dataset, I end up with the "Memory" error.

3
  • Have you read the documentation about sys.getsizeof? docs.python.org/3/library/sys.html#sys.getsizeof Commented Feb 11, 2018 at 3:41
  • Hi. Are you talking about this part "but this does not have to hold true for third-party extensions as it is implementation specific." I don't think it holds true in my case as I'm getting a "Memory" error for X when I increase the data size, but don't seem to be getting a "Memory" error for the list of arrays. I also checked the memory size with task manager and it seems to be true. Commented Feb 11, 2018 at 3:47
  • sys.getsizeof only gives you the size of the list, not including the objects in the list. That is the source of the discrepancy. Commented Feb 11, 2018 at 4:01

1 Answer 1

1

From the sys.getsizeof docs:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.

sys.getsizeof returns the memory consumption of the list object itself, not including the objects contained by the list. A single one of your arrays:

In [3]: arr = np.zeros(dtype=np.float64, shape=(60, 2094))

In [4]: arr.size
Out[4]: 125640

In [5]: arr.nbytes
Out[5]: 1005120 

The python object wrapping the primitive array adds about 100 bytes.

Note, there is always overhead for being an object, note:

In [6]: sys.getsizeof(arr)
Out[6]: 1005232

The actual memory consumption, then is about:

In [7]: arr.nbytes*1e-9
Out[7]: 0.00100512 # one megabyte

And if we had 2940 of them, just those objects would be:

In [8]: arr.nbytes*2940*1e-9
Out[8]: 2.9550528000000003 # almost 3 gigabytes

If I actually put these all in a list:

In [13]: alist = []

In [14]: alist.append(arr)

In [15]: for _ in range(2940 - 1):
    ...:     alist.append(arr.copy())
    ...:

The list object itself is essentially backed by an array of py_object pointers. On my machine (64bit) a pointer will be one machine word, i.e. 64bits or 8 bytes. So:

In [19]: sys.getsizeof(alist)
Out[19]: 23728

In [20]: 8*len(alist) # 8 bytes per pointer
Out[20]: 23520

So sys.getsizeof is only accounting for an array of pointers, plus object overhead, but that isn't even close to accounting for the 3 gigabytes consumed by the array objects being pointed to.

Lo and behold:

In [21]: arr = np.array(alist)

In [22]: arr.shape
Out[22]: (2940, 60, 2094)

In [23]: arr.size
Out[23]: 369381600

In [24]: arr.nbytes
Out[24]: 2955052800

In [25]: arr.nbytes* 1e-9
Out[25]: 2.9550528000000003
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for this. I just want to run a few tests tomorrow just to make sure, as I was experiencing some conflicting behavior as to what you just listed out.
@Moondra the other thing to understand is that when you do arr = np.array(alist) then it will require double since data is copied not shared

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.