Differences between dataframe spearman correlation using pandas and scipy

Question

I have a fairly big matrix (4780, 5460) and computed the spearman correlation between rows using both "pandas.DataFrame.corr" and "scipy.stats.spearmanr". Each function return very different correlation coeficients, and now I am not sure which is the "correct", or if my dataset it more suitable to a different implementation.

Some contextualization: the vectors (rows) I want to test for correlation do not necessarily have all same points, there are NaN in some columns and not in others.

df.T.corr(method='spearman')
(r, p) = spearmanr(df.T)
df2 = pd.DataFrame(index=df.index, columns=df.columns, data=r)

In[47]: df['320840_93602.563']
Out[47]: 
320840_93602.563                    1.000000
3254_642.148.peg.3256               0.565812
13752_42938.1206                    0.877192
319002_93602.870                    0.225530
328_642.148.peg.330                 0.658269
                                      ...   
12566_42938.19                      0.818395
321125_93602.2882                   0.535577
319185_93602.1135                   0.678397
29724_39.3584                       0.770453
321030_93602.1962                   0.738722
Name: 320840_93602.563, dtype: float64

In[32]: df2['320840_93602.563']
Out[32]: 
320840_93602.563                    1.000000
3254_642.148.peg.3256               0.444675
13752_42938.1206                    0.286933
319002_93602.870                    0.225530
328_642.148.peg.330                 0.606619
                                      ...   
12566_42938.19                      0.212265
321125_93602.2882                   0.587409
319185_93602.1135                   0.696172
29724_39.3584                       0.097753
321030_93602.1962                   0.163417
Name: 320840_93602.563, dtype: float64

Also, watch out for the intrinsic data alignment. Scipy doesn't appear to do it in all cases, while pandas will. See here: stackoverflow.com/questions/74375662/… — Blaze
– Blaze, Commented Nov 9, 2022 at 21:09

Warren Weckesser · Accepted Answer · 2017-08-07 17:12:24Z

7

scipy.stats.spearmanr is not designed to handle nan, and its behavior with nan values is undefined. [Update: scipy.stats.spearmanr now has the argument nan_policy.]

For data without nans, the functions appear to agree:

In [92]: np.random.seed(123)

In [93]: df = pd.DataFrame(np.random.randn(5, 5))

In [94]: df.T.corr(method='spearman')
Out[94]: 
     0    1    2    3    4
0  1.0 -0.8  0.8  0.7  0.1
1 -0.8  1.0 -0.7 -0.7 -0.1
2  0.8 -0.7  1.0  0.8 -0.1
3  0.7 -0.7  0.8  1.0  0.5
4  0.1 -0.1 -0.1  0.5  1.0

In [95]: rho, p = spearmanr(df.values.T)

In [96]: rho
Out[96]: 
array([[ 1. , -0.8,  0.8,  0.7,  0.1],
       [-0.8,  1. , -0.7, -0.7, -0.1],
       [ 0.8, -0.7,  1. ,  0.8, -0.1],
       [ 0.7, -0.7,  0.8,  1. ,  0.5],
       [ 0.1, -0.1, -0.1,  0.5,  1. ]])

edited Aug 7, 2017 at 17:12

answered Jul 15, 2015 at 17:09

Warren Weckesser

116k20 gold badges207 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Differences between dataframe spearman correlation using pandas and scipy

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related