I have a sparse 2D matrix saved on a disk (.npz extension) that I've created in preprocessing step with scipy.sparse.csr_matrix. It is a long sequence of piano-roll (a numerical form of MIDI representation) format 1-channel image. I cannot convert whole matrix to dense representation - it will not fit in my memory.
How do I create mini-batches with predefined sizes from the sparse matrix?
I've tried converting CSR representation to COO and creating batches of data from it.
sparse_matrix = sc.sparse.load_npz(file_name)
coo_matrix = sparse_matrix.tocoo()
for batch_index in range(num_batches):
start_index = batch_index * num_samples
end_index = (batch_index + 1) * num_samples
start_index = batch_index * num_samples
end_index = (batch_index + 1) * num_samples
batch_data = coo_matrix.data[start_index:end_index]
batch_row = coo_matrix.row[start_index:end_index]
batch_col = coo_matrix.col[start_index:end_index]
batch_sparse_matrix = scipy.sparse.coo_matrix(
(batch_data, (batch_row, batch_col)),
shape=(batch_size, image_width*image_height)
)
but I got errors like: row index exceeds matrix dimensions which means I have too much data for the shape I defined. The row and col index is outside of shape boundaries.
I've tried something like this, to get the right amount of data, but it's very slow.
non_zero_indices = np.where((co_matrix.row >= start_index) & (co_matrix.row < end_index))[0]
start_index = non_zero_indices[0]
end_index = non_zero_indices[-1] + 1
indptr,indicesanddataattributes of acsr?indptrcan be used index rows. But you can also use indexingM[10:100]returns a 90 row slice (copy) ofM. row indexing is relatively efficent. docs.scipy.org/doc/scipy/reference/generated/…rows_to_extract = np.arange(start_index, end_index); batch_sparse_matrix = self.sparse_matrix[rows_to_extract, :]