3

I am correlating two data frames using the code below. basically, choosing set of columns from one data frame (a) and one column from the other data frame (b). It works perfectly, except I would need to do it with a spearman's option. I would appreciate any input or ideas. Thank you...

 a.ix[:,800000:800010].corrwith(b.ix[:,0])

1 Answer 1

9

Consider using pandas.Series.corr in an dataframe apply where you pass each column into a function, here the anonymous lambda, and pair each with the b column:

Random data (seeded to reproduce)

import pandas as pd
import numpy as np

np.random.seed(50)

a = pd.DataFrame({'A':np.random.randn(50),
                  'B':np.random.randn(50),
                  'C':np.random.randn(50),
                  'D':np.random.randn(50),
                  'E':np.random.randn(50)})

b = pd.DataFrame({'test':np.random.randn(10)})

Reproducing Pearson correlation

pear_result1 = a.ix[:,0:5].corrwith(b.ix[:,0])
print(pear_result1)
# A   -0.073506
# B   -0.098045
# C    0.166293
# D    0.123491
# E    0.348576
# dtype: float64

pear_result2 = a.apply(lambda col: col.corr(b.ix[:,0], method='pearson'), axis=0)
print(pear_result2)
# A   -0.073506
# B   -0.098045
# C    0.166293
# D    0.123491
# E    0.348576
# dtype: float64

print(pear_result1 == pear_result2)
# A    True
# B    True
# C    True
# D    True
# E    True
# dtype: bool

Spearman correlation

spr_result = a.apply(lambda col: col.corr(b.ix[:,0], method='spearman'), axis=0)
print(spr_result)
# A   -0.018182
# B   -0.103030
# C    0.321212
# D   -0.151515
# E    0.321212
# dtype: float64

Spearman coefficient with pvalues

from scipy.stats import spearmanr, pearsonr

# SERIES OF TUPLES (<scipy.stats.stats.SpearmanrResult> class)
spr_all_result = a.apply(lambda col: spearmanr(col, b.ix[:,0]), axis=0)

# SERIES OF FLOATS
spr_corr = a.apply(lambda col: spearmanr(col, b.ix[:,0])[0], axis=0)
spr_pvalues = a.apply(lambda col: spearmanr(col, b.ix[:,0])[1], axis=0)
Sign up to request clarification or add additional context in comments.

3 Comments

That is perfect parfait...in fact, i can still apply my original column selection for data frame...it would be like this with your example: (a.ix[:,0:5]).apply(lambda col: col.corr(b.ix[:,0], method='pearson'), axis=0)......thank you!!
I just realized....is there an easy way to generate pvalues here...? without having to use scipy.stats.......And if I have to use scipy.stats, do you know by any chance, how I can apply the same framing you just worked out to the...thanks..
Works great!...thanks both ways. It does not seem I have much of reputation to increase your points...I did the check!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.