3

I'm using a pandas dataframe, and applying a trend to the earliest data point to try and fill in missing historical data as best as possible. I know iterating over a pandas dataframe is wrong, but I haven't found an alternative way to do this as the new value depends on the next value. If anyone knows a better way to achieve this, that would be great!

Example df:

   Week no  Data  Trend
0        1   0.0    1.5
1        2   0.0    1.5
2        3   0.0    1.0
3        4   0.0    0.5
4        5  10.0    0.6

The code I am currently using:

for wk in range(len(df)-1, 0, -1):
       if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
               and not math.isnan(df.loc[wk, 'Trend'])):
           df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
                                          *df.loc[wk, 'Trend'])

The result:

  Week no  Data  Trend
0        1   4.5    1.5
1        2   3.0    1.5
2        3   3.0    1.0
3        4   6.0    0.5
4        5  10.0    0.6

1 Answer 1

5

Recursive calculations are not vectorisable, for improve performance is used numba:

from numba import jit

@jit(nopython=True)
def f(a, b):
    for i in range(a.shape[0]-1, 0, -1):
        if (a[i] != 0) and (a[i-1] == 0) and not np.isnan(b[i]):
            a[i-1] = a[i] * b[i]
    return a

df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
print (df)

   Week no  Data  Trend
0        1   4.5    1.5
1        2   3.0    1.5
2        3   3.0    1.0
3        4   6.0    0.5
4        5  10.0    0.6

First test with no missing values like data in sample:

df = pd.concat([df] * 40, ignore_index=True)
print (df)
     Week  no  Data  Trend
0       0   1   4.5    1.5
1       1   2   3.0    1.5
2       2   3   3.0    1.0
3       3   4   6.0    0.5
4       4   5  10.0    0.6
..    ...  ..   ...    ...
195     0   1   4.5    1.5
196     1   2   3.0    1.5
197     2   3   3.0    1.0
198     3   4   6.0    0.5
199     4   5  10.0    0.6

[200 rows x 4 columns]

In [114]: %%timeit
     ...: df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
     ...: 
     ...: 
121 µs ± 2.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

df = pd.concat([df] * 40, ignore_index=True)

print (df.shape)
(200, 4)


In [115]: %%timeit
     ...: for wk in range(len(df)-1, 0, -1):
     ...:         if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
     ...:                 and not math.isnan(df.loc[wk, 'Trend'])):
     ...:             df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
     ...:                                           *df.loc[wk, 'Trend'])
     ...:                                           
3.3 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I test with 2 * 40 missing values and performance is similar:

print (df)
   Week  no  Data  Trend
0     0   1   0.0    NaN
1     1   2   0.0    NaN
2     2   3   0.0    1.0
3     3   4   0.0    0.5
4     4   5  10.0    0.6


df = pd.concat([df] * 40, ignore_index=True)

print (df.shape)
(200, 4)

   
In [117]: %%timeit
     ...: df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
     ...: 
119 µs ± 480 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

df = pd.concat([df] * 40, ignore_index=True)

print (df.shape)
(200, 4)


In [121]: %%timeit
     ...: for wk in range(len(df)-1, 0, -1):
     ...:         if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
     ...:                 and not math.isnan(df.loc[wk, 'Trend'])):
     ...:             df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
     ...:                                           *df.loc[wk, 'Trend'])
     ...:                                           
3.12 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up to request clarification or add additional context in comments.

3 Comments

Is there a certain size of dataframe that this method would be faster? I've just tried this against my method on the example dataframe, running each 100 times and your method took ~15 seconds while mine took ~0.2 (basic timing using the time module, so nothing fancy!) I'm wondering if your method is only faster on larger dataframes? For my purposes, I'll only need to do this on a maximum of 200 rows.
@EmiOB - hmmm, not sure what is problem, but for me it working well - numba is faster like your solution, answer was edited.
My bad! I was defining the function within the loop, have fixed now and it is faster. Thanks for your help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.