0

I have a Pandas data frame and wish to demean each of the numeric columns, leaving the categorical variable column entries unchanged. By "demean" I simply wish to subtract from each column entry the mean of all entries in the corresponding column.

The data frame comes 569 patients in the Wisconsin Breast Cancer directory, listing for each patient 10 various numeric measurements, along with a diagnosis of M (malignant) or B (benign).

import pandas as pd

df = pd.read_csv('data/UWbcd.csv')
%load_ext google.colab.data_table. #just for purposes of browsing the data
df - df.mean()

Using this method, the entries in each numeric column are demeaned fine, but the categorical variables,

df['Diagnosis']

all become NaN.

Is there an efficient way to leave categorical variables alone when demeaning?

1
  • df.apply(lambda s: s - s.mean() if (s.dtype == np.int or s.dtype == np.float) else s). Commented Oct 22, 2020 at 1:31

1 Answer 1

1

I would do something like the following, create an array of columns you want to de-mean.

numerical_cols = ['col1', 'col2', 'col5']

There you can use loc to only select the columns you want, you can assign this to a new df or back into the current data frame.

df.loc[:, numerical_cols] = df.loc[:, numerical_cols] - def.loc[:, numerical_cols].mean()

df_demean = df.loc[:, numerical_cols] - def.loc[:, numerical_cols].mean()

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.