How to exclude certain columns of a pandas dataframe?

Question

I have the following dataset in a .csv file:

feature1, feature2, feature3, feature4
0, 42, 2, 1000
2, 13, ?, 997
1, 30, ?, 861
2, 29, ?, ?

I would like to create a pandas dataframe or a numpy array where I don't have the features with an x% of unknown data (where x was specified previously in code).

The question is no clear to me , can you show the expected output — BENY
– BENY, Commented Sep 28, 2017 at 2:44
For an example of 0% of admission of missing data, I would like to save only feature1, feature2 and their respective data to my pandas dataframe For 25%, feature4 also would be included — ftoyoshima
– ftoyoshima, Commented Sep 28, 2017 at 2:56
So, you're trying to replace all the ? with something? Is that your question? — Joe T. Boka
– Joe T. Boka, Commented Sep 28, 2017 at 3:07
No, I have to exclude the features in which there's too much '?' from my analysis. — ftoyoshima
– ftoyoshima, Commented Sep 28, 2017 at 3:14
are they question marks or NaN values. This is important because the dataframe currently has mixed types — DJK
– DJK, Commented Sep 28, 2017 at 3:18

BENY · Accepted Answer · 2017-09-28 03:11:00Z

4

By using replace and dropna (PS, you need using the parameter thresh in dropna )

import pandas as pd
import numpy as np
df.replace('?', np.NaN).dropna(axis=1,thresh=0.75*len(df)) # for you example , we only accpet one NA here

Out[735]: 
   feature1  feature2  feature4
0         0         1     100.0
1         2         2     900.0
2         1         3     861.0
3         2         4       NaN

Data Input

df = pd.DataFrame({'feature1': [0,2,1,2], 'feature2': [1,2,3,4],'feature3':[2,'?','?','?'],'feature4':[100,900,861,'?']})

edited Sep 28, 2017 at 3:11

answered Sep 28, 2017 at 3:05

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Joe T. Boka Over a year ago

Wow, it seems like this is kind of a bug in SO site. You posted an answer earlier, then deleted it. I kept the page open while I was working on my answer and kept checking if anyone answered already. But, the page didn't display your answer because, I guess, you "undeleted" your original answer. SO didn't show any new answers. So, I had no way of knowing that anyone answered already.

piRSquared · Accepted Answer · 2017-09-28 08:20:51Z

1

I'm going to assume those '?' are null values. If they aren't, do something like this:

df = df.apply(pd.to_numeric, errors='coerce')

Now, we can make a function that takes a dataframe and a threshold. What we want to do is use loc with a boolean series that tells us which columns have sufficient data representation.

drp = lambda d, x: d.loc[:, d.isnull().mean() < x]

drp(df, .5)

   feature1  feature2  feature4
0         0        42    1000.0
1         2        13     997.0
2         1        30     861.0
3         2        29       NaN

If you insist that '?' stay that way... and we can also include NaN

d = df.mask(df.astype(object).eq('?'))

drp = lambda d, x: d.loc[:, d.isnull().mean() < x]

drp(d, .5)

edited Sep 28, 2017 at 8:20

answered Sep 28, 2017 at 5:57

piRSquared

296k68 gold badges509 silver badges654 bronze badges

1 Comment

Joe T. Boka Over a year ago

The OP actually said in his comment that the ? are not null values.

Joe T. Boka · Accepted Answer · 2017-09-28 05:23:50Z

0

This is probably the easiest way to solve it, that if I understand your question correctly. You can change ? to NaN using np.nan, then use df.loc and df.isnull to select the columns you need.

df.replace(to_replace= '\?', value=np.nan, inplace=True, regex=True)
df = df.loc[:, (df.isnull().sum() <= len(df) / 4)]
print (df)
        feature1  feature2  feature4
0         0         42      1000
1         2         13       997
2         1         30       861
3         2         29       NaN

edited Sep 28, 2017 at 5:23

answered Sep 28, 2017 at 4:56

Joe T. Boka

6,5896 gold badges33 silver badges49 bronze badges

Collectives™ on Stack Overflow

How to exclude certain columns of a pandas dataframe?

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related