Sparse matrix in pandas/scipy with row and column indices

Question

I have a dataframe in pandas that looks like this:

>>> df[['BranchNumber', 'ModelArticleNumber', 'ActualSellingPrice']].info()
<class 'pandas.core.frame.DataFrame'>
Index: 447970 entries, 0 to 500734
Data columns (total 3 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   BranchNumber        447970 non-null  int64  
 1   ModelArticleNumber  447970 non-null  object 
 2   ActualSellingPrice  447970 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 13.7+ MB

Given that there are 463 unique branch numbers and 5,235 model article numbers, I'd like to create a 463x5,235 sparse matrix with indices. The following kludge is how far I've gotten:

import numpy as np
import scipy

def index(df):
    ix = { v: n for n, v in enumerate(df.unique()) }
    return np.array(list(map(lambda value: ix[value], df)))

def csr(df, row_tag, col_tag, value_tag):
    m = scipy.sparse.csr_matrix(
        (
            df[value_tag],
            (
                index(df[row_tag]),
                index(df[col_tag])
            )
        )
    )
        
    return m

I'd like to be able to do things like m.T.dot(m) efficiently, and retain the ability to see which key is associated with the rows and columns. I've looked through the documentation on pandas indices and arrays, and through the scipy docs on sparse matrices, and I'm just not there. Can somebody point me in the right direction?

Nick ODell · Accepted Answer · 2024-12-12 22:08:52Z

I'd like to be able to do things like m.T.dot(m) efficiently, ...

This design looks like a reasonable way to do that.

... and retain the ability to see which key is associated with the rows and columns.

That's more complicated. SciPy sparse arrays don't natively support labeled data. You could keep the row/column mapping around in a separate variable, though. You can use pd.factorize() to create the row/column values and the row/column index simultaneously.

e.g.

def csr(df, row_tag, col_tag, value_tag):
    rows, row_labels = pd.factorize(df[row_tag])
    cols, col_labels = pd.factorize(df[col_tag])
    m = scipy.sparse.csr_matrix(
        (
            df[value_tag],
            (
                rows,
                cols
            )
        ),
        shape=(len(row_labels), len(col_labels))
    )

    return m, row_labels, col_labels

Then if you want to know the label for what is in m[10, 20], you can look at row_labels[10] and col_labels[20].

Alternatively, you could use a sparse DataFrame. Pandas has a dtype, SparseDtype, which stores the data for a column in a sparse format. Using the previous csr() method, you could do something like:

m, row_labels, col_labels = csr(df, 'BranchNumber', 'ModelArticleNumber', 'ActualSellingPrice')
df_sparse = pd.DataFrame.sparse.from_spmatrix(m, index=row_labels, columns=col_labels)

That gives you a labelled sparse DataFrame.

Collectives™ on Stack Overflow

Sparse matrix in pandas/scipy with row and column indices

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related