1

I would like to perform correlation test using python (equivalent to corr.test(x,y) in R)

My input is a Pandas dataframe. Looks something like the following:

df1:

  Column1  Column2   Column3   Column4 Column5 Column6
0    ab1       bc1   6.843147     NaN     5.12   NaN
1    ab2       ab5   NaN          5.6789  6.666  54.72
2    ab3       bc4   11.45        NaN     12.765 5.12 
3    ab4       ab5   328.880123   NaN     0.50  88.44
4    ab5       ab1   72.142790    55.89   NaN   18.12

How do I perform correlation for the data (column3 - column6)?

Note: There are more than 50 columns for correlation in the original data.

1 Answer 1

1

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html


Or do any pair of columns at once (remembering that each column is a series) ... with

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.corr.html

For example, given your data above, the correlation between colums 5 and 6 is given by:

In [10]: df
Out[10]:
  Column1 Column2     Column3  Column4  Column5  Column6
0     ab1     bc1    6.843147      NaN    5.120      NaN
1     ab2     ab5         NaN   5.6789    6.666    54.72
2     ab3     bc4   11.450000      NaN   12.765     5.12
3     ab4     ab5  328.880123      NaN    0.500    88.44
4     ab5     ab1   72.142790  55.8900      NaN    18.12

In [11]: df.loc[:,'Column5'].corr(df.loc[:,'Column6'])
Out[11]: -0.9936504010065057

Or to loop through all columns (not the most elegant, but this works) ...

In [12]: for c1 in df.columns[0:-1]:
    ...:   for c2 in df.loc[:,c1:].columns:
    ...:     if c2 != c1:
    ...:       print('Correlation',c1,c2,'=',df.loc[:,c1].corr(df.loc[:,c2]))
    ...:
...function_base.py:2551: RuntimeWarning: Degrees of freedom <= 0 for slice 
    c = cov(x, y, rowvar)
...function_base.py:2480: RuntimeWarning: divide by zero encountered in true_divide 
    c *= np.true_divide(1, fact)

Correlation Column3 Column4 = nan
Correlation Column3 Column5 = -0.779129
Correlation Column3 Column6 = 0.999368
Correlation Column4 Column5 = nan
Correlation Column4 Column6 = -1.000000
Correlation Column5 Column6 = -0.993650

For an entire correlation matrix:

In [36]: df
Out[36]:
  Column1 Column2     Column3  Column4  Column5  Column6
0     ab1     bc1    6.843147      NaN    5.120      NaN
1     ab2     ab5         NaN   5.6789    6.666    54.72
2     ab3     bc4   11.450000      NaN   12.765     5.12
3     ab4     ab5  328.880123      NaN    0.500    88.44
4     ab5     ab1   72.142790  55.8900      NaN    18.12

In [37]: df.corr()
Out[37]:
          Column3  Column4   Column5   Column6
Column3  1.000000      NaN -0.779129  0.999368
Column4       NaN      1.0       NaN -1.000000
Column5 -0.779129      NaN  1.000000 -0.993650
Column6  0.999368     -1.0 -0.993650  1.000000

Notice that with DataFrame.corr() which gives a correlation matrix, the intersection of any two columns displays the same correlation that was arrived at using Series.corr() while looping through the columns. Thus the DataFrame.corr() approach is simpler code-wise because you don't have to write your own loops.

P.S. I just realized you want the p-value also (not just the correlation coefficients) since the R function cor.test() returns both coefficient and significance. I'm not sure how to do that with Pandas. I poked around and found this:  About half-way down that page it states, "Pandas does not have a function that calculates p-values, so it is better to use SciPy to calculate correlation as it will give you both p-value and correlation coefficient," and then shows how to do that.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the information. Could you please let me know a detailed or specific example? Thanks again!
@SucharitaMuthuswamy - I have edited the answer to include some examples using your data. Hope that helps.
Thank you very much! I was trying to use Pingouin (github.com/raphaelvallat/pingouin), since it was similar to the R output. But, if I divide the df into x and y I alway get an error saying AttributeError: 'str' object has no attribute '_get_numeric_data'.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.