How to demean only numeric columns in a Pandas dataframe containing categorical variables?

Question

I have a Pandas data frame and wish to demean each of the numeric columns, leaving the categorical variable column entries unchanged. By "demean" I simply wish to subtract from each column entry the mean of all entries in the corresponding column.

The data frame comes 569 patients in the Wisconsin Breast Cancer directory, listing for each patient 10 various numeric measurements, along with a diagnosis of M (malignant) or B (benign).

import pandas as pd

df = pd.read_csv('data/UWbcd.csv')
%load_ext google.colab.data_table. #just for purposes of browsing the data
df - df.mean()

Using this method, the entries in each numeric column are demeaned fine, but the categorical variables,

df['Diagnosis']

all become NaN.

Is there an efficient way to leave categorical variables alone when demeaning?

df.apply(lambda s: s - s.mean() if (s.dtype == np.int or s.dtype == np.float) else s). — Abdou
– Abdou, Commented Oct 22, 2020 at 1:31

Kirk · Accepted Answer · 2020-10-22 01:28:50Z

1

I would do something like the following, create an array of columns you want to de-mean.

numerical_cols = ['col1', 'col2', 'col5']

There you can use loc to only select the columns you want, you can assign this to a new df or back into the current data frame.

df.loc[:, numerical_cols] = df.loc[:, numerical_cols] - def.loc[:, numerical_cols].mean()

df_demean = df.loc[:, numerical_cols] - def.loc[:, numerical_cols].mean()

answered Oct 22, 2020 at 1:28

Kirk

1711 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to demean only numeric columns in a Pandas dataframe containing categorical variables?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related