0

EDIT 3: TL;DR My issue was due to my matrix not being sparse enough and also calculating the size of a sparse array incorrectly.

was hoping someone could explain to me why this is happening. I am using colab with 51 GB of memory and I need to load data from an H5 file, float32. I am able to load a test H5 file as numpy array and RAM ~ 45 GB. I loaded that in batches (21 total) and stack it. then I try to load the data into numpy convert into sparse and hstack the data and the memory explodes and I get an OOM after batch 12 or so.

this code simulates it and you can change the data size to test it on your computer. I get completely unexplainable memory increases even though when I look at the size of my variables in memory, they seem small. What is happening? what am I doing wrong?

import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
for k in range(8):
  if all_x is None:
    all_x = x2
  else:
    all_x = sparse.hstack([all_x, x2])
  print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
  print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
  gc.collect()
  print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
  print('_____________________')
GB on Memory SPARSE  0.481035332
GB on Memory NUMPY  0.797949952
sparse to dense mat ratio 0.6028389760464576
_____________________
GB on Memory ALL SPARSE  0.481035332
GB USED BEFORE GC 4.62065664
GB USED AFTER GC 4.6206976
_____________________
GB on Memory ALL SPARSE  0.962070664
GB USED BEFORE GC 8.473133056
GB USED AFTER GC 8.473133056
_____________________
GB on Memory ALL SPARSE  1.443105996
GB USED BEFORE GC 12.325183488
GB USED AFTER GC 12.325183488
_____________________
GB on Memory ALL SPARSE  1.924141328
GB USED BEFORE GC 17.140740096
GB USED AFTER GC 17.140740096
_____________________
GB on Memory ALL SPARSE  2.40517666
GB USED BEFORE GC 20.512710656
GB USED AFTER GC 20.512710656
_____________________
GB on Memory ALL SPARSE  2.886211992
GB USED BEFORE GC 22.920142848
GB USED AFTER GC 22.920142848
_____________________
GB on Memory ALL SPARSE  3.367247324
GB USED BEFORE GC 29.660889088
GB USED AFTER GC 29.660889088
_____________________
GB on Memory ALL SPARSE  3.848282656
GB USED BEFORE GC 33.99727104
GB USED AFTER GC 33.99727104
_____________________

EDIT: I stacked a list in numpy hstack and it works fine

import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')

all_x = np.hstack([x]*21)

print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
print('_____________________')

output

GB on Memory SPARSE  0.480956104
GB on Memory NUMPY  0.797949952
sparse to dense mat ratio 0.6027396866113227
_____________________
GB on Memory ALL SPARSE  16.756948992
GB USED BEFORE GC 38.169387008
GB USED AFTER GC 38.169411584
_____________________

but when I do the same with sparse matrix I get an OOM. according to the bytes the sparse matrix should be smaller.

import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')

all_x = sparse.hstack([x2]*21)

print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
print('_____________________')

but when i do above it returns OOM error

EDIT 2 it seems I was calculating the true size of the sparse matrix incorrectly. it can be calculated using

def bytes_in_sparse(a):
  return  a.data.nbytes + a.indptr.nbytes + a.indices.nbytes

the true comparison between the dense and sparse arrays are

GB on Memory SPARSE  0.962395268
GB on Memory NUMPY  0.797949952
sparse to dense mat ratio 1.2060847495357703

Once I use sparse.hstack the two variables become different types of sparse matrices.

all_x, x2

outputs

(<97406x4096 sparse matrix of type '<class 'numpy.float32'>'
    with 240476696 stored elements in COOrdinate format>,
 <97406x2048 sparse matrix of type '<class 'numpy.float32'>'
    with 120238348 stored elements in Compressed Sparse Row format>)
5
  • As show many times with np.append (and other concatenate functions), you should not do this kind of stack repeatedly in a loop. all_x = sparse.hstack([all_x, x2]). It makes a copy each time. sparse.hstack joins the coo attributes of your matrices together to make a new coo matrix. Collect all you matrices in a list, and do just one hstack at the end. Commented Mar 6, 2022 at 23:43
  • thank you so much for your response. to the best of my knowledge I tested what you said with the new edit provided above. can you please see that edit and if I am mis understanding please could you explain? also wouldn't the fact that it temporarily copies the matrix only lead to memory spikes not sustained memory usage? thank you Commented Mar 6, 2022 at 23:58
  • Don't use np.hstack to "join" sparse matrices. Commented Mar 7, 2022 at 0:24
  • I didn't. I used all_x = sparse.hstack([x2]*21) for the sparse but I used np.hstack on the dense. x2 is sparse x is dense Commented Mar 7, 2022 at 0:27
  • I guess my matrices are not sparse enough to benefit from using a sparse matrix and I just calculated the size wrong. thanks again for your help! Commented Mar 7, 2022 at 1:15

1 Answer 1

1

With smaller dimensions so I don't hang my computer

In [50]: x = (1 * (np.random.rand(974, 204) > 0.39721115241072164)).astype("float32")
In [51]: x.nbytes
Out[51]: 794784

THe csr and approximate memory use:

In [52]: M = sparse.csr_matrix(x)
In [53]: M.data.nbytes + M.indices.nbytes + M.indptr.nbytes
Out[53]: 960308

hstack actually uses the coo format:

In [54]: Mo = M.tocoo()
In [55]: Mo.data.nbytes + Mo.row.nbytes + Mo.col.nbytes
Out[55]: 1434612

Combining 10 copies - nbytes increases by 10:

In [56]: xx = np.hstack([x]*10)
In [57]: xx.shape
Out[57]: (974, 2040)

Same with sparse:

In [58]: MM = sparse.hstack([M] * 10)
In [59]: MM.shape
Out[59]: (974, 2040)
In [60]: xx.nbytes
Out[60]: 7947840
In [61]: MM
Out[61]: 
<974x2040 sparse matrix of type '<class 'numpy.float32'>'
    with 1195510 stored elements in Compressed Sparse Row format>
In [62]: M
Out[62]: 
<974x204 sparse matrix of type '<class 'numpy.float32'>'
    with 119551 stored elements in Compressed Sparse Row format>
In [63]: MM.data.nbytes + MM.indices.nbytes + MM.indptr.nbytes
Out[63]: 9567980

A sparse density of

In [65]: M.nnz / np.prod(M.shape)
Out[65]: 0.6016779401699078

does not save memory. 0.1 or smaller is a good working density if you want to both save memory and computation time (especially matrix multiplication).

In [66]: ([email protected]).shape
Out[66]: (974, 974)
In [67]: timeit([email protected]).shape
10.1 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [68]: ([email protected]).shape
Out[68]: (974, 974)
In [69]: timeit([email protected]).shape
220 ms ± 91.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

1 Comment

great thank you! I was wondering how to calculate the COO sparse array Mo.data.nbytes + Mo.row.nbytes + Mo.col.nbytes I accepted your answer because you made it clear what was happening, thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.