2

I need to filter outliers in a dataset. Replacing the outlier with the previous value in the column makes the most sense in my application.

I was having considerable difficulty doing this with the pandas tools available (mostly to do with copies on slices, or type conversions occurring when setting to NaN).

Is there a fast and/or memory efficient way to do this? (Please see my answer below for the solution I am currently using, which also has limitations.)


A simple example:

>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,1000,6,7,8],'B':list('abcdefgh')})
>>> df
      A  B
0     1  a
1     2  b
2     3  c
3     4  d
4  1000  e # '1000  e' --> '4  e'
5     6  f
6     7  g
7     8  h

2 Answers 2

2

You can simply mask values over your threshold and use ffill:

df.assign(A=df.A.mask(df.A.gt(10)).ffill())

     A  B
0  1.0  a
1  2.0  b
2  3.0  c
3  4.0  d
4  4.0  e
5  6.0  f
6  7.0  g
7  8.0  h

Using mask is necessary rather than something like shift, because it guarantees non-outlier output in the case that the previous value is also above a threshold.

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you for this suggestion. However, this has the issue with implicit type conversion during the NaN insertion which I alluded to. Is it possible to avoid this?
You can avoid by casting back to int, or using a numpy masked array potentially
Can you update with an example of the np masked array? I'd prefer not to always cast back to int, since the column may be float in other instances.
@nivk df.assign(A=pd.to_numeric(df.A.mask(df.A.gt(10)).ffill(), downcast='integer')) should flexibly convert the types. You might even go from float to int if all of the floats are outliers and get replaced.
It happens because NaN is a float
|
0

I circumvented some of the issues with pandas copies and slices by converting to a numpy array first, doing the operations there, and then re-inserting the column. I'm not certain, but as far as I can tell, the datatype is the same once it is put back into the pandas.DataFrame.

def df_replace_with_previous(df,col,maskfunc,inplace=False):
    arr = np.array(df[col])
    mask = maskfunc(arr)
    arr[ mask ] = arr[ list(mask)[1:]+[False] ]
    if inplace:
        df[col] = arr
        return
    else:
        df2 = df.copy()
        df2[col] = arr
        return df2

This creates a mask, shifts it down by one so that the True values point at the previous entry, and updates the array. Of course, this will need to run recursively if there are multiple adjacent outliers (N times if there are N consecutive outliers), which is not ideal.

Usage in the case given in OP:

df_replace_with_previous(df,'A',lambda x:x>10,False)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.