Conditionally replace values in pandas.DataFrame with previous value

Question

I need to filter outliers in a dataset. Replacing the outlier with the previous value in the column makes the most sense in my application.

I was having considerable difficulty doing this with the pandas tools available (mostly to do with copies on slices, or type conversions occurring when setting to NaN).

Is there a fast and/or memory efficient way to do this? (Please see my answer below for the solution I am currently using, which also has limitations.)

A simple example:

>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,1000,6,7,8],'B':list('abcdefgh')})
>>> df
      A  B
0     1  a
1     2  b
2     3  c
3     4  d
4  1000  e # '1000  e' --> '4  e'
5     6  f
6     7  g
7     8  h

user3483203 · Accepted Answer · 2018-10-13 03:37:31Z

2

You can simply mask values over your threshold and use ffill:

df.assign(A=df.A.mask(df.A.gt(10)).ffill())

     A  B
0  1.0  a
1  2.0  b
2  3.0  c
3  4.0  d
4  4.0  e
5  6.0  f
6  7.0  g
7  8.0  h

Using mask is necessary rather than something like shift, because it guarantees non-outlier output in the case that the previous value is also above a threshold.

answered Oct 13, 2018 at 3:37

user3483203

51.3k10 gold badges72 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

nivk Over a year ago

Thank you for this suggestion. However, this has the issue with implicit type conversion during the NaN insertion which I alluded to. Is it possible to avoid this?

user3483203 Over a year ago

You can avoid by casting back to int, or using a numpy masked array potentially

nivk Over a year ago

Can you update with an example of the np masked array? I'd prefer not to always cast back to int, since the column may be float in other instances.

ALollz Over a year ago

@nivk df.assign(A=pd.to_numeric(df.A.mask(df.A.gt(10)).ffill(), downcast='integer')) should flexibly convert the types. You might even go from float to int if all of the floats are outliers and get replaced.

user3483203 Over a year ago

It happens because NaN is a float

|

nivk · Accepted Answer · 2018-10-13 02:54:56Z

I circumvented some of the issues with pandas copies and slices by converting to a numpy array first, doing the operations there, and then re-inserting the column. I'm not certain, but as far as I can tell, the datatype is the same once it is put back into the pandas.DataFrame.

def df_replace_with_previous(df,col,maskfunc,inplace=False):
    arr = np.array(df[col])
    mask = maskfunc(arr)
    arr[ mask ] = arr[ list(mask)[1:]+[False] ]
    if inplace:
        df[col] = arr
        return
    else:
        df2 = df.copy()
        df2[col] = arr
        return df2

This creates a mask, shifts it down by one so that the True values point at the previous entry, and updates the array. Of course, this will need to run recursively if there are multiple adjacent outliers (N times if there are N consecutive outliers), which is not ideal.

Usage in the case given in OP:

df_replace_with_previous(df,'A',lambda x:x>10,False)

Collectives™ on Stack Overflow

Conditionally replace values in pandas.DataFrame with previous value

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related