12

I get my data from an SQL query from the table to my pandas Dataframe. The data looks like:

        group  phone_brand
0      M32-38          小米
1      M32-38          小米
2      M32-38          小米
3      M29-31          小米
4      M29-31          小米
5      F24-26         OPPO
6      M32-38          酷派
7      M32-38          小米
8      M32-38         vivo
9      F33-42          三星
10     M29-31          华为
11     F33-42          华为
12     F27-28          三星
13     M32-38          华为
14       M39+         艾优尼
15     F27-28          华为
16     M32-38          小米
17     M32-38          小米
18       M39+          魅族
19     M32-38          小米
20     F33-42          三星
21     M23-26          小米
22     M23-26          华为
23     M27-28          三星
24     M29-31          小米
25     M32-38          三星
26     M32-38          三星
27     F33-42          三星
28     M32-38          三星
29     M32-38          三星
...       ...          ...
74809  M27-28          华为
74810  M29-31          TCL

Now I want to find the correlation and the frequency from these two columns and put this in a visualization with Matplotlib. I tried something like:

DataFrame.plot(style='o')
plt.show() 

Now how can I visualize this correlation in the simplest way?

1
  • You must first label the categories in columns with numbers; don't know how the Chinese symbols will be read (but serlialization should help); and then look for correlation. A heatmap is a good way to visualize the correlation matrix. find inspiration here: Heatmap Commented Oct 29, 2017 at 16:00

3 Answers 3

20

To quickly get a correlation:

df.apply(lambda x: x.factorize()[0]).corr()

                group  phone_brand
group        1.000000     0.427941
phone_brand  0.427941     1.000000

Heat map

import seaborn as sns

sns.heatmap(pd.crosstab(df.group, df.phone_brand))

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

0

Use pandas.factorize() method which can get the numeric representation of an array by identifying distinct values.

Comments

0

Apart from the method piRSquared very clearly explained, you can use LabelEncoder which transforms the values into numeric form in order to make sure that the machine interprets the features correctly.

#Import label encoder
from sklearn.preprocessing import LabelEncoder

#label_encoder object 
le = LabelEncoder()

#Fit label encoder and return encoded labels
df['group'] = le.fit_transform(df['group'])

df['phone_brand'] = le.fit_transform(df['phone_brand'] )

#Finding correlation
df.corr()

#output for first 10 rows

               group     phone_brand
      group  1.00000         0.67391
phone_brand  0.67391         1.00000

After applying LabelEncoder, our DataFrame converted from this

     group  phone_brand
0   M32-38          小米
1   M32-38          小米
2   M32-38          小米
3   M29-31          小米
4   M29-31          小米
5   F24-26         OPPO
6   M32-38          酷派
7   M32-38          小米
8   M32-38         vivo
9   F33-42          三星
10  M29-31          华为

to this

   group    phone_brand
0      3              4
1      3              4
2      3              4
3      2              4
4      2              4
5      0              0
6      3              5
7      3              4
8      3              1
9      1              2
10     2              3

For multiple columns, you can go through the answers.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.