EDIT 3: TL;DR My issue was due to my matrix not being sparse enough and also calculating the size of a sparse array incorrectly.
was hoping someone could explain to me why this is happening. I am using colab with 51 GB of memory and I need to load data from an H5 file, float32. I am able to load a test H5 file as numpy array and RAM ~ 45 GB. I loaded that in batches (21 total) and stack it. then I try to load the data into numpy convert into sparse and hstack the data and the memory explodes and I get an OOM after batch 12 or so.
this code simulates it and you can change the data size to test it on your computer. I get completely unexplainable memory increases even though when I look at the size of my variables in memory, they seem small. What is happening? what am I doing wrong?
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
for k in range(8):
if all_x is None:
all_x = x2
else:
all_x = sparse.hstack([all_x, x2])
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
GB on Memory SPARSE 0.481035332
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 0.6028389760464576
_____________________
GB on Memory ALL SPARSE 0.481035332
GB USED BEFORE GC 4.62065664
GB USED AFTER GC 4.6206976
_____________________
GB on Memory ALL SPARSE 0.962070664
GB USED BEFORE GC 8.473133056
GB USED AFTER GC 8.473133056
_____________________
GB on Memory ALL SPARSE 1.443105996
GB USED BEFORE GC 12.325183488
GB USED AFTER GC 12.325183488
_____________________
GB on Memory ALL SPARSE 1.924141328
GB USED BEFORE GC 17.140740096
GB USED AFTER GC 17.140740096
_____________________
GB on Memory ALL SPARSE 2.40517666
GB USED BEFORE GC 20.512710656
GB USED AFTER GC 20.512710656
_____________________
GB on Memory ALL SPARSE 2.886211992
GB USED BEFORE GC 22.920142848
GB USED AFTER GC 22.920142848
_____________________
GB on Memory ALL SPARSE 3.367247324
GB USED BEFORE GC 29.660889088
GB USED AFTER GC 29.660889088
_____________________
GB on Memory ALL SPARSE 3.848282656
GB USED BEFORE GC 33.99727104
GB USED AFTER GC 33.99727104
_____________________
EDIT: I stacked a list in numpy hstack and it works fine
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
all_x = np.hstack([x]*21)
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
output
GB on Memory SPARSE 0.480956104
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 0.6027396866113227
_____________________
GB on Memory ALL SPARSE 16.756948992
GB USED BEFORE GC 38.169387008
GB USED AFTER GC 38.169411584
_____________________
but when I do the same with sparse matrix I get an OOM. according to the bytes the sparse matrix should be smaller.
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
all_x = sparse.hstack([x2]*21)
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
but when i do above it returns OOM error
EDIT 2 it seems I was calculating the true size of the sparse matrix incorrectly. it can be calculated using
def bytes_in_sparse(a):
return a.data.nbytes + a.indptr.nbytes + a.indices.nbytes
the true comparison between the dense and sparse arrays are
GB on Memory SPARSE 0.962395268
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 1.2060847495357703
Once I use sparse.hstack the two variables become different types of sparse matrices.
all_x, x2
outputs
(<97406x4096 sparse matrix of type '<class 'numpy.float32'>'
with 240476696 stored elements in COOrdinate format>,
<97406x2048 sparse matrix of type '<class 'numpy.float32'>'
with 120238348 stored elements in Compressed Sparse Row format>)
np.append(and otherconcatenatefunctions), you should not do this kind of stack repeatedly in a loop.all_x = sparse.hstack([all_x, x2]). It makes a copy each time.sparse.hstackjoins thecooattributes of your matrices together to make a newcoomatrix. Collect all you matrices in a list, and do just onehstackat the end.np.hstackto "join" sparse matrices.all_x = sparse.hstack([x2]*21)for the sparse but I used np.hstack on the dense. x2 is sparse x is dense