3

I have the following dataset in a .csv file:

feature1, feature2, feature3, feature4
0, 42, 2, 1000
2, 13, ?, 997
1, 30, ?, 861
2, 29, ?, ?

I would like to create a pandas dataframe or a numpy array where I don't have the features with an x% of unknown data (where x was specified previously in code).

6
  • The question is no clear to me , can you show the expected output Commented Sep 28, 2017 at 2:44
  • For an example of 0% of admission of missing data, I would like to save only feature1, feature2 and their respective data to my pandas dataframe For 25%, feature4 also would be included Commented Sep 28, 2017 at 2:56
  • So, you're trying to replace all the ? with something? Is that your question? Commented Sep 28, 2017 at 3:07
  • No, I have to exclude the features in which there's too much '?' from my analysis. Commented Sep 28, 2017 at 3:14
  • are they question marks or NaN values. This is important because the dataframe currently has mixed types Commented Sep 28, 2017 at 3:18

3 Answers 3

4

By using replace and dropna (PS, you need using the parameter thresh in dropna )

import pandas as pd
import numpy as np
df.replace('?', np.NaN).dropna(axis=1,thresh=0.75*len(df)) # for you example , we only accpet one NA here

Out[735]: 
   feature1  feature2  feature4
0         0         1     100.0
1         2         2     900.0
2         1         3     861.0
3         2         4       NaN

Data Input

df = pd.DataFrame({'feature1': [0,2,1,2], 'feature2': [1,2,3,4],'feature3':[2,'?','?','?'],'feature4':[100,900,861,'?']})
Sign up to request clarification or add additional context in comments.

1 Comment

Wow, it seems like this is kind of a bug in SO site. You posted an answer earlier, then deleted it. I kept the page open while I was working on my answer and kept checking if anyone answered already. But, the page didn't display your answer because, I guess, you "undeleted" your original answer. SO didn't show any new answers. So, I had no way of knowing that anyone answered already.
1

I'm going to assume those '?' are null values. If they aren't, do something like this:

df = df.apply(pd.to_numeric, errors='coerce')

Now, we can make a function that takes a dataframe and a threshold. What we want to do is use loc with a boolean series that tells us which columns have sufficient data representation.

drp = lambda d, x: d.loc[:, d.isnull().mean() < x]

drp(df, .5)

   feature1  feature2  feature4
0         0        42    1000.0
1         2        13     997.0
2         1        30     861.0
3         2        29       NaN

If you insist that '?' stay that way... and we can also include NaN

d = df.mask(df.astype(object).eq('?'))

drp = lambda d, x: d.loc[:, d.isnull().mean() < x]

drp(d, .5)

1 Comment

The OP actually said in his comment that the ? are not null values.
0

This is probably the easiest way to solve it, that if I understand your question correctly. You can change ? to NaN using np.nan, then use df.loc and df.isnull to select the columns you need.

df.replace(to_replace= '\?', value=np.nan, inplace=True, regex=True)
df = df.loc[:, (df.isnull().sum() <= len(df) / 4)]
print (df)
        feature1  feature2  feature4
0         0         42      1000
1         2         13       997
2         1         30       861
3         2         29       NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.