Correlation between two non-numeric columns in a Pandas DataFrame

Question

I get my data from an SQL query from the table to my pandas Dataframe. The data looks like:

        group  phone_brand
0      M32-38          小米
1      M32-38          小米
2      M32-38          小米
3      M29-31          小米
4      M29-31          小米
5      F24-26         OPPO
6      M32-38          酷派
7      M32-38          小米
8      M32-38         vivo
9      F33-42          三星
10     M29-31          华为
11     F33-42          华为
12     F27-28          三星
13     M32-38          华为
14       M39+         艾优尼
15     F27-28          华为
16     M32-38          小米
17     M32-38          小米
18       M39+          魅族
19     M32-38          小米
20     F33-42          三星
21     M23-26          小米
22     M23-26          华为
23     M27-28          三星
24     M29-31          小米
25     M32-38          三星
26     M32-38          三星
27     F33-42          三星
28     M32-38          三星
29     M32-38          三星
...       ...          ...
74809  M27-28          华为
74810  M29-31          TCL

Now I want to find the correlation and the frequency from these two columns and put this in a visualization with Matplotlib. I tried something like:

DataFrame.plot(style='o')
plt.show()

Now how can I visualize this correlation in the simplest way?

You must first label the categories in columns with numbers; don't know how the Chinese symbols will be read (but serlialization should help); and then look for correlation. A heatmap is a good way to visualize the correlation matrix. find inspiration here: Heatmap — skrubber
– skrubber, Commented Oct 29, 2017 at 16:00

piRSquared · Accepted Answer · 2017-10-29 16:52:46Z

20

To quickly get a correlation:

df.apply(lambda x: x.factorize()[0]).corr()

                group  phone_brand
group        1.000000     0.427941
phone_brand  0.427941     1.000000

Heat map

import seaborn as sns

sns.heatmap(pd.crosstab(df.group, df.phone_brand))

answered Oct 29, 2017 at 16:52

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

A. Rehman · Accepted Answer · 2019-04-18 18:56:38Z

0

Use pandas.factorize() method which can get the numeric representation of an array by identifying distinct values.

answered Apr 18, 2019 at 18:56

A. Rehman

1

Comments

HMS · Accepted Answer · 2022-08-04 20:12:19Z

Apart from the method piRSquared very clearly explained, you can use LabelEncoder which transforms the values into numeric form in order to make sure that the machine interprets the features correctly.

#Import label encoder
from sklearn.preprocessing import LabelEncoder

#label_encoder object 
le = LabelEncoder()

#Fit label encoder and return encoded labels
df['group'] = le.fit_transform(df['group'])

df['phone_brand'] = le.fit_transform(df['phone_brand'] )

#Finding correlation
df.corr()

#output for first 10 rows

               group     phone_brand
      group  1.00000         0.67391
phone_brand  0.67391         1.00000

After applying LabelEncoder, our DataFrame converted from this

     group  phone_brand
0   M32-38          小米
1   M32-38          小米
2   M32-38          小米
3   M29-31          小米
4   M29-31          小米
5   F24-26         OPPO
6   M32-38          酷派
7   M32-38          小米
8   M32-38         vivo
9   F33-42          三星
10  M29-31          华为

to this

   group    phone_brand
0      3              4
1      3              4
2      3              4
3      2              4
4      2              4
5      0              0
6      3              5
7      3              4
8      3              1
9      1              2
10     2              3

For multiple columns, you can go through the answers.

Collectives™ on Stack Overflow

Correlation between two non-numeric columns in a Pandas DataFrame

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related