1

I have a fairly big matrix (4780, 5460) and computed the spearman correlation between rows using both "pandas.DataFrame.corr" and "scipy.stats.spearmanr". Each function return very different correlation coeficients, and now I am not sure which is the "correct", or if my dataset it more suitable to a different implementation.

Some contextualization: the vectors (rows) I want to test for correlation do not necessarily have all same points, there are NaN in some columns and not in others.

df.T.corr(method='spearman')
(r, p) = spearmanr(df.T)
df2 = pd.DataFrame(index=df.index, columns=df.columns, data=r)

In[47]: df['320840_93602.563']
Out[47]: 
320840_93602.563                    1.000000
3254_642.148.peg.3256               0.565812
13752_42938.1206                    0.877192
319002_93602.870                    0.225530
328_642.148.peg.330                 0.658269
                                      ...   
12566_42938.19                      0.818395
321125_93602.2882                   0.535577
319185_93602.1135                   0.678397
29724_39.3584                       0.770453
321030_93602.1962                   0.738722
Name: 320840_93602.563, dtype: float64

In[32]: df2['320840_93602.563']
Out[32]: 
320840_93602.563                    1.000000
3254_642.148.peg.3256               0.444675
13752_42938.1206                    0.286933
319002_93602.870                    0.225530
328_642.148.peg.330                 0.606619
                                      ...   
12566_42938.19                      0.212265
321125_93602.2882                   0.587409
319185_93602.1135                   0.696172
29724_39.3584                       0.097753
321030_93602.1962                   0.163417
Name: 320840_93602.563, dtype: float64
1
  • Also, watch out for the intrinsic data alignment. Scipy doesn't appear to do it in all cases, while pandas will. See here: stackoverflow.com/questions/74375662/… Commented Nov 9, 2022 at 21:09

1 Answer 1

7

scipy.stats.spearmanr is not designed to handle nan, and its behavior with nan values is undefined. [Update: scipy.stats.spearmanr now has the argument nan_policy.]

For data without nans, the functions appear to agree:

In [92]: np.random.seed(123)

In [93]: df = pd.DataFrame(np.random.randn(5, 5))

In [94]: df.T.corr(method='spearman')
Out[94]: 
     0    1    2    3    4
0  1.0 -0.8  0.8  0.7  0.1
1 -0.8  1.0 -0.7 -0.7 -0.1
2  0.8 -0.7  1.0  0.8 -0.1
3  0.7 -0.7  0.8  1.0  0.5
4  0.1 -0.1 -0.1  0.5  1.0

In [95]: rho, p = spearmanr(df.values.T)

In [96]: rho
Out[96]: 
array([[ 1. , -0.8,  0.8,  0.7,  0.1],
       [-0.8,  1. , -0.7, -0.7, -0.1],
       [ 0.8, -0.7,  1. ,  0.8, -0.1],
       [ 0.7, -0.7,  0.8,  1. ,  0.5],
       [ 0.1, -0.1, -0.1,  0.5,  1. ]])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.