0

I have a dataframe in pandas that looks like this:

>>> df[['BranchNumber', 'ModelArticleNumber', 'ActualSellingPrice']].info()
<class 'pandas.core.frame.DataFrame'>
Index: 447970 entries, 0 to 500734
Data columns (total 3 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   BranchNumber        447970 non-null  int64  
 1   ModelArticleNumber  447970 non-null  object 
 2   ActualSellingPrice  447970 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 13.7+ MB

Given that there are 463 unique branch numbers and 5,235 model article numbers, I'd like to create a 463x5,235 sparse matrix with indices. The following kludge is how far I've gotten:

import numpy as np
import scipy

def index(df):
    ix = { v: n for n, v in enumerate(df.unique()) }
    return np.array(list(map(lambda value: ix[value], df)))

def csr(df, row_tag, col_tag, value_tag):
    m = scipy.sparse.csr_matrix(
        (
            df[value_tag],
            (
                index(df[row_tag]),
                index(df[col_tag])
            )
        )
    )
        
    return m

I'd like to be able to do things like m.T.dot(m) efficiently, and retain the ability to see which key is associated with the rows and columns. I've looked through the documentation on pandas indices and arrays, and through the scipy docs on sparse matrices, and I'm just not there. Can somebody point me in the right direction?

1 Answer 1

3

I'd like to be able to do things like m.T.dot(m) efficiently, ...

This design looks like a reasonable way to do that.

... and retain the ability to see which key is associated with the rows and columns.

That's more complicated. SciPy sparse arrays don't natively support labeled data. You could keep the row/column mapping around in a separate variable, though. You can use pd.factorize() to create the row/column values and the row/column index simultaneously.

e.g.

def csr(df, row_tag, col_tag, value_tag):
    rows, row_labels = pd.factorize(df[row_tag])
    cols, col_labels = pd.factorize(df[col_tag])
    m = scipy.sparse.csr_matrix(
        (
            df[value_tag],
            (
                rows,
                cols
            )
        ),
        shape=(len(row_labels), len(col_labels))
    )

    return m, row_labels, col_labels

Then if you want to know the label for what is in m[10, 20], you can look at row_labels[10] and col_labels[20].

Alternatively, you could use a sparse DataFrame. Pandas has a dtype, SparseDtype, which stores the data for a column in a sparse format. Using the previous csr() method, you could do something like:

m, row_labels, col_labels = csr(df, 'BranchNumber', 'ModelArticleNumber', 'ActualSellingPrice')
df_sparse = pd.DataFrame.sparse.from_spmatrix(m, index=row_labels, columns=col_labels)

That gives you a labelled sparse DataFrame.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.