Updating pandas dataframe based on next value

Question

I'm using a pandas dataframe, and applying a trend to the earliest data point to try and fill in missing historical data as best as possible. I know iterating over a pandas dataframe is wrong, but I haven't found an alternative way to do this as the new value depends on the next value. If anyone knows a better way to achieve this, that would be great!

Example df:

   Week no  Data  Trend
0        1   0.0    1.5
1        2   0.0    1.5
2        3   0.0    1.0
3        4   0.0    0.5
4        5  10.0    0.6

The code I am currently using:

for wk in range(len(df)-1, 0, -1):
       if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
               and not math.isnan(df.loc[wk, 'Trend'])):
           df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
                                          *df.loc[wk, 'Trend'])

The result:

  Week no  Data  Trend
0        1   4.5    1.5
1        2   3.0    1.5
2        3   3.0    1.0
3        4   6.0    0.5
4        5  10.0    0.6

jezrael · Accepted Answer · 2020-10-26 10:16:03Z

5

Recursive calculations are not vectorisable, for improve performance is used numba:

from numba import jit

@jit(nopython=True)
def f(a, b):
    for i in range(a.shape[0]-1, 0, -1):
        if (a[i] != 0) and (a[i-1] == 0) and not np.isnan(b[i]):
            a[i-1] = a[i] * b[i]
    return a

df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
print (df)

   Week no  Data  Trend
0        1   4.5    1.5
1        2   3.0    1.5
2        3   3.0    1.0
3        4   6.0    0.5
4        5  10.0    0.6

First test with no missing values like data in sample:

df = pd.concat([df] * 40, ignore_index=True)
print (df)
     Week  no  Data  Trend
0       0   1   4.5    1.5
1       1   2   3.0    1.5
2       2   3   3.0    1.0
3       3   4   6.0    0.5
4       4   5  10.0    0.6
..    ...  ..   ...    ...
195     0   1   4.5    1.5
196     1   2   3.0    1.5
197     2   3   3.0    1.0
198     3   4   6.0    0.5
199     4   5  10.0    0.6

[200 rows x 4 columns]

In [114]: %%timeit
     ...: df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
     ...: 
     ...: 
121 µs ± 2.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

df = pd.concat([df] * 40, ignore_index=True)

print (df.shape)
(200, 4)


In [115]: %%timeit
     ...: for wk in range(len(df)-1, 0, -1):
     ...:         if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
     ...:                 and not math.isnan(df.loc[wk, 'Trend'])):
     ...:             df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
     ...:                                           *df.loc[wk, 'Trend'])
     ...:                                           
3.3 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I test with 2 * 40 missing values and performance is similar:

print (df)
   Week  no  Data  Trend
0     0   1   0.0    NaN
1     1   2   0.0    NaN
2     2   3   0.0    1.0
3     3   4   0.0    0.5
4     4   5  10.0    0.6


df = pd.concat([df] * 40, ignore_index=True)

print (df.shape)
(200, 4)

   
In [117]: %%timeit
     ...: df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
     ...: 
119 µs ± 480 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

df = pd.concat([df] * 40, ignore_index=True)

print (df.shape)
(200, 4)


In [121]: %%timeit
     ...: for wk in range(len(df)-1, 0, -1):
     ...:         if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
     ...:                 and not math.isnan(df.loc[wk, 'Trend'])):
     ...:             df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
     ...:                                           *df.loc[wk, 'Trend'])
     ...:                                           
3.12 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Oct 26, 2020 at 10:16

answered Oct 26, 2020 at 9:01

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Emi OB Over a year ago

Is there a certain size of dataframe that this method would be faster? I've just tried this against my method on the example dataframe, running each 100 times and your method took ~15 seconds while mine took ~0.2 (basic timing using the time module, so nothing fancy!) I'm wondering if your method is only faster on larger dataframes? For my purposes, I'll only need to do this on a maximum of 200 rows.

jezrael Over a year ago

@EmiOB - hmmm, not sure what is problem, but for me it working well - numba is faster like your solution, answer was edited.

Emi OB Over a year ago

My bad! I was defining the function within the loop, have fixed now and it is faster. Thanks for your help!

Collectives™ on Stack Overflow

Updating pandas dataframe based on next value

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related