I have a dataframe in pandas that looks like this:
>>> df[['BranchNumber', 'ModelArticleNumber', 'ActualSellingPrice']].info()
<class 'pandas.core.frame.DataFrame'>
Index: 447970 entries, 0 to 500734
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 BranchNumber 447970 non-null int64
1 ModelArticleNumber 447970 non-null object
2 ActualSellingPrice 447970 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 13.7+ MB
Given that there are 463 unique branch numbers and 5,235 model article numbers, I'd like to create a 463x5,235 sparse matrix with indices. The following kludge is how far I've gotten:
import numpy as np
import scipy
def index(df):
ix = { v: n for n, v in enumerate(df.unique()) }
return np.array(list(map(lambda value: ix[value], df)))
def csr(df, row_tag, col_tag, value_tag):
m = scipy.sparse.csr_matrix(
(
df[value_tag],
(
index(df[row_tag]),
index(df[col_tag])
)
)
)
return m
I'd like to be able to do things like m.T.dot(m) efficiently, and retain the ability to see which key is associated with the rows and columns. I've looked through the documentation on pandas indices and arrays, and through the scipy docs on sparse matrices, and I'm just not there. Can somebody point me in the right direction?