python - use previous row's value to update the new rows values

Question

This is the current dataframe:

> ID        Date    current
> 2001980   10/30/2017  1   
> 2001980   10/29/2017  0   
> 2001980   10/28/2017  0   
> 2001980   10/27/2017  40  
> 2001980   10/26/2017  39  
> 2001980   10/25/2017  0   
> 2001980   10/24/2017  0   
> 2001980   10/23/2017  60  
> 2001980   10/22/2017  0   
> 2001980   10/21/2017  0   
> 2002222   10/21/2017  0   
> 2002222   10/20/2017  0   
> 2002222   10/19/2017  16  
> 2002222   10/18/2017  0   
> 2002222   10/17/2017  0   
> 2002222   10/16/2017  20  
> 2002222   10/15/2017  19  
> 2002222   10/14/2017  18

Below is the final data frame. Column expected is what I am trying to get.

One ID might have multiple date/record/rows. (ID+Date) is unique.
this row's expected value = last row's expected - 1
the minimum value is 0.
Based on the formula in 2, if this row's expected value < this row's current value, then use this row's current value. for example, for ID 2001980 on 10/23/2017. Based on rule 2, the value should be 36, but based on rule 4, 36<60, so we use 60.

thank you so much.

> ID        Date    current expected 
> 2001980   10/30/2017  1   1 
> 2001980   10/29/2017  0   0
> 2001980   10/28/2017  0   0 
> 2001980   10/27/2017  40  40
> 2001980   10/26/2017  39  39 
> 2001980   10/25/2017  0   38
> 2001980   10/24/2017  0   37 
> 2001980   10/23/2017  60  60
> 2001980   10/22/2017  0   59 
> 2001980   10/21/2017  0   58
> 2002222   10/21/2017  0   0
> 2002222   10/20/2017  0   0 
> 2002222   10/19/2017  16  16
> 2002222   10/18/2017  0   15 
> 2002222   10/17/2017  0   14
> 2002222   10/16/2017  20  20
> 2002222   10/15/2017  19  19
> 2002222   10/14/2017  18  18

I am using Excel with the formula below:

= if(this row's ID = last row's ID, max(last row's expected value - 1, this row's current value), this row's current value)

well, have you tried anything already and stuck somewhere? because I don't see any code here. — murphy1310
– murphy1310, Commented Mar 2, 2018 at 17:19
Please let us know if you are actually struggling at a particular point. Suggestion: avoid loops when using dataframes for such tasks. — murphy1310
– murphy1310, Commented Mar 2, 2018 at 18:11
I just refine the question, feel free to point out if there is anything unclear, — qqqwww
– qqqwww, Commented Apr 25, 2018 at 18:20
Please clarify why you expect zero at 2002222 10/21/2017, shouldn't it be 57? — igrinis
– igrinis, Commented Apr 30, 2018 at 13:22
@igrinis sorry for the misunderstanding. because that is the new ID, every new ID, the most recent one can be any number(>=0), we don't need touch that value. same as 2001980 10/30/2017 1, we don't need to update that value too. — qqqwww
– qqqwww, Commented Apr 30, 2018 at 21:51

Scott Boston · Accepted Answer · 2018-05-02 18:07:02Z

Revised simpler:

df['expected'] = df.groupby(['ID',df.current.ne(0).cumsum()])['current']\
  .transform(lambda x: x.eq(0).cumsum().mul(-1).add(x.iloc[0])).clip(0,np.inf)

Let's have a little fun:

df['expected'] = (df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill()) +
df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1))\
.clip(0,np.inf).fillna(0).astype(int)
print(df)

Output:

         ID        Date  current  expected
0   2001980  10/30/2017        1         1
1   2001980  10/29/2017        0         0
2   2001980  10/28/2017        0         0
3   2001980  10/27/2017       40        40
4   2001980  10/26/2017       39        39
5   2001980  10/25/2017        0        38
6   2001980  10/24/2017        0        37
7   2001980  10/23/2017       60        60
8   2001980  10/22/2017        0        59
9   2001980  10/21/2017        0        58
10  2002222  10/21/2017        0         0
11  2002222  10/20/2017        0         0
12  2002222  10/19/2017       16        16
13  2002222  10/18/2017        0        15
14  2002222  10/17/2017        0        14
15  2002222  10/16/2017       20        20
16  2002222  10/15/2017       19        19
17  2002222  10/14/2017       18        18

Details

Basically, creating series, s1 and subtracting series s2 then clipping negative values and filling nan's with zero.

#Let's calculate two series first a series to fill the zeros in an 'ID' with the previous non-zero value 
s1 = df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill())
s1

Output:

0      1.0
1      1.0
2      1.0
3     40.0
4     39.0
5     39.0
6     39.0
7     60.0
8     60.0
9     60.0
10     NaN
11     NaN
12    16.0
13    16.0
14    16.0
15    20.0
16    19.0
17    18.0
Name: current, dtype: float64

#Second series is a cumulative count of zeroes in a group by 'ID'
s2 = df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1)
s2

Output:

0     0
1    -1
2    -2
3     0
4     0
5    -1
6    -2
7     0
8    -1
9    -2
10   -1
11   -2
12    0
13   -1
14   -2
15    0
16    0
17    0
Name: current, dtype: int32

Add series together clip and fillna.

(s1 + s2).clip(0, np.inf).fillna(0)

Output:

0      1.0
1      0.0
2      0.0
3     40.0
4     39.0
5     38.0
6     37.0
7     60.0
8     59.0
9     58.0
10     0.0
11     0.0
12    16.0
13    15.0
14    14.0
15    20.0
16    19.0
17    18.0
Name: current, dtype: float64

@qqqwww run some timings on your million row dataset I am curious to the performance.

Tarun Lalwani · Accepted Answer · 2018-04-26 17:07:17Z

3

+50

So you can do this used apply and nested functions

import pandas as pd
ID = [2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2002222,2002222,2002222,2002222,2002222,2002222,2002222,2002222,]
Date = ["10/30/2017","10/29/2017","10/28/2017","10/27/2017","10/26/2017","10/25/2017","10/24/2017","10/23/2017","10/22/2017","10/21/2017","10/21/2017","10/20/2017","10/19/2017","10/18/2017","10/17/2017","10/16/2017","10/15/2017","10/14/2017",]
current = [1 ,0 ,0 ,40,39,0 ,0 ,60,0 ,0 ,0 ,0 ,16,0 ,0 ,20,19,18,]

df = pd.DataFrame({"ID": ID, "Date": Date, "current": current})

Then create the function to update the frame

Python 3.X

def update_frame(df):
    last_expected = None
    def apply_logic(row):
        nonlocal last_expected
        last_row_id = row.name - 1
        if row.name == 0:
            last_expected = row["current"]
            return last_expected
        last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
        last_expected = max(last_expected-1,row['current']) if last_row['ID'] == row['ID'] else row['current']
        return last_expected
    return apply_logic

Python 2.X

def update_frame(df):
    sd = {"last_expected": None}
    def apply_logic(row):
        last_row_id = row.name - 1
        if row.name == 0:
            sd['last_expected'] = row["current"]
            return sd['last_expected']
        last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
        sd['last_expected'] = max(sd['last_expected'] - 1,row['current']) if last_row['ID'] == row['ID'] else row['current']
        return sd['last_expected']
    return apply_logic

And run the function like below

df['expected'] = df.apply(update_frame(df), axis=1)

The output is as expected

edited Apr 26, 2018 at 17:07

answered Apr 25, 2018 at 20:19

Tarun Lalwani

147k11 gold badges216 silver badges277 bronze badges

7 Comments

qqqwww Over a year ago

hi Tarun, I am using Python 2. it looks like nonlocal has been introduced in Python 3. Also, could I know what is row.name at here?

Tarun Lalwani Over a year ago

nonlocal can be easily ported to python 2.7 and row.name gives you the index of that row. Let me update the solution with python2.7

Tarun Lalwani Over a year ago

@qqqwww, please check the updated answer with python 2.7 code

qqqwww Over a year ago

thank you. after validation with my major df. all the value matched my expected value. 1. the row in def apply_logic(row) is actually each row in the df? 2. why do you check row.name == 0, is this becuase the last_row_id = -1 at here, so you want make sure the rest of code begin from row[1] and there is always has a last row? 3. I think this part is the key last_row = df.iloc[[last_row_id]].iloc[0].to_dict(), could you please give some explain? 4. why do we need function update_frame? why not use the function apply_logic directly?

Tarun Lalwani Over a year ago

1. yes, apply will send each row when you have put axis=1. 2. The row.name == 0 is to return the current value for the first row as expected, because there is no previous row in that case. 3. The last_row = df.iloc[[last_row_id]].iloc[0].to_dict() is to access the last row from df using the index of the row and the get the values as column name to value dictionary mapping. 4. We need update_frame as a nested function so that we can use a shared variable to stored the expected_value for the last result. For Single function, will need to use global var outside the function (bad design)

|

mechanical_meat · Accepted Answer · 2018-03-02 19:38:40Z

0

You can use a conditional statement combined with .shift() to get previous row, and np.where which AFAIK does not rely on loops as mentioned in a comment as something to avoid:

df['test'] = np.where(df['current'].shift() < 
                      df['current'], df['current'] - 1, df['current'])

Result (I added a 'test' column) with the result; you can change to 'expected' if you so desire):

>>> df
         ID        Date  current  expected  test
0   2001980  10/30/2017        1         1     1
1   2001980  10/29/2017        0         0     0
2   2001980  10/28/2017        0         0     0
3   2001980  10/27/2017       40        40    39
4   2001980  10/26/2017       39        39    39
5   2001980  10/25/2017       38        38    38
6   2001980  10/24/2017       37        37    37
7   2001980  10/18/2017        0        36     0
8   2001980  10/17/2017        0        35     0
9   2001980  10/16/2017       60        60    59
10  2001980  10/15/2017        0        59     0
11  2001980  10/14/2017        0        58     0
12  2001980  10/13/2017        0        57     0
13  2001980  10/12/2017        0        56     0
14  2002222  10/21/2017        0         0     0
15  2002222  10/20/2017        0         0     0
16  2002222  10/19/2017       16        16    15
17  2002222  10/18/2017        0        15     0
18  2002222  10/17/2017        0        14     0
19  2002222  10/16/2017       13        13    12
20  2002222  10/15/2017       12        12    12
21  2002222  10/14/2017       11        11    11
22  2002222  10/13/2017       10        10    10
23  2002222  10/12/2017        9         9     9

edited Mar 2, 2018 at 19:38

answered Mar 2, 2018 at 19:33

mechanical_meat

170k25 gold badges237 silver badges231 bronze badges

5 Comments

qqqwww Over a year ago

Hi Bernie, the column expected is my expected final result. I just manually create this column. What I want is use Python to create this column. My apologies for misunderstanding.

mechanical_meat Over a year ago

@qqqwww: yep, just change 'test' -> 'expected' and you're good to go!

qqqwww Over a year ago

the value in row 3 should be 40. the value in row 16 should be 16, not 15. the value in row 19 should be 13, not 12.

mechanical_meat Over a year ago

Then I believe your logic is faulty. I am pretty sure I have implemented what you described your logic should be. Please revisit your logic, and get back to us.

mechanical_meat Over a year ago

Well, you IMO have completely changed the question with those additional requirements.

floydn · Accepted Answer · 2018-04-27 20:43:16Z

EDIT: To address OP's concern about scaling up to millions of rows.

Yes, my original answer will not scale to very large dataframes. However, with minor edits, this easy-to-read solution will scale. The optimizations that follow take advantage of the JIT compiler in Numba. After importing Numba, I add the jit decorator and modified the function to operate on a numpy arrays instead of the pandas objects. Numba is numpy-aware and cannot optimize pandas objects.

import numba

@numba.jit
def expected(id_col, current_col):
    lexp = []
    lstID = 0
    expected = 0
    for i in range(len(id_col)):
        id, current = id_col[i], current_col[i]
        if id == lstID:
            expected = max(current, max(expected - 1, 0))
        else:
            expected = current
        lexp.append(expected)
        lstID = id
    return np.array(lexp)

To pass a numpy array to the function, use the .values attribute of the pandas series.

df1['expected'] = expected(df1.ID.values, df1.current.values)

To test the performance, I scaled up your original dataframe to more than 1 million rows.

df1 = df
while len(df1) < 1000000:
    df1 = pd.concat([df1, df1])
df1.reset_index(inplace=True, drop=True)

The new changes perform very well.

%timeit expected(df1.ID.values, df1.current.values)
44.9 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

df1.shape
Out[65]: (1179648, 4)

df1.tail(15)
Out[66]: 
              ID        Date  current  expected
1179633  2001980  10/27/2017       40        40
1179634  2001980  10/26/2017       39        39
1179635  2001980  10/25/2017        0        38
1179636  2001980  10/24/2017        0        37
1179637  2001980  10/23/2017       60        60
1179638  2001980  10/22/2017        0        59
1179639  2001980  10/21/2017        0        58
1179640  2002222  10/21/2017        0         0
1179641  2002222  10/20/2017        0         0
1179642  2002222  10/19/2017       16        16
1179643  2002222  10/18/2017        0        15
1179644  2002222  10/17/2017        0        14
1179645  2002222  10/16/2017       20        20
1179646  2002222  10/15/2017       19        19
1179647  2002222  10/14/2017       18        18

ORIGINAL ANSWER

A little brute force but really easy to follow.

    def expected(df):
        lexp = []
        lstID = None
        expected = 0
        for i in range(len(df)):
            id, current = df[['ID', 'current']].iloc[i]
            if id == lstID:
                expected = max(expected - 1, 0)
                expected = max(current, expected)
            else:
                expected = current
            lexp.append(expected)
            lstID = id
        return pd.Series(lexp)

Output

df['expected'] = expected(df)

df
Out[53]: 
         ID        Date  current  expected
0   2001980  10/30/2017        1         1
1   2001980  10/29/2017        0         0
2   2001980  10/28/2017        0         0
3   2001980  10/27/2017       40        40
4   2001980  10/26/2017       39        39
5   2001980  10/25/2017        0        38
6   2001980  10/24/2017        0        37
7   2001980  10/23/2017       60        60
8   2001980  10/22/2017        0        59
9   2001980  10/21/2017        0        58
10  2002222  10/21/2017        0         0
11  2002222  10/20/2017        0         0
12  2002222  10/19/2017       16        16
13  2002222  10/18/2017        0        15
14  2002222  10/17/2017        0        14
15  2002222  10/16/2017       20        20
16  2002222  10/15/2017       19        19
17  2002222  10/14/2017       18        18

my df has multi-millions rows, I think using for i in range(len(df) might very slow. do you think so?

jxc · Accepted Answer · 2018-04-30 14:09:32Z

0

I believe @Tarun Lalwani had pointed you to one right direction. that is to save some critical information outside the DataFrame. the code can be simplified though, and there is nothing wrong with using global variables as long as you manage name properly. it's one of the design patterns which can often make things simpler and improve readability.

cached_last = { 'expected': None, 'ID': None }

def set_expected(x):
    if cached_last['ID'] is None or x.ID != cached_last['ID']:
        expected = x.current
    else:
        expected =  max(cached_last['expected'] - 1, x.current)
    cached_last['ID'] = x.ID
    cached_last['expected'] = expected
    return expected

df['expected'] = df.apply(set_expected, axis=1)

From the documentation on pandas.DataFrame.apply, do be careful about the potential side-effects of the apply function.

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

edited Apr 30, 2018 at 14:09

answered Apr 30, 2018 at 12:05

jxc

14k4 gold badges20 silver badges37 bronze badges

2 Comments

qqqwww Over a year ago

how did you set up the value for cached_last? Looks like it missed one step before you do if cached_last['ID'] is None or x.ID != cached_last['ID']:

jxc Over a year ago

On top: cached_last is defined outside the function set_expected cached_last = { 'expected': None, 'ID': None }. In the first processing row, you will get None, and after that the ID and expected will be set at the end of each call so when processing the next row they return the values from the last row. I think this is a very common code practice. right?

BENY · Accepted Answer · 2018-05-01 01:58:56Z

0

Logic here should be work

lst=[]

for _, y in df.groupby('ID'):
    z=[]
    for i,(_, x) in enumerate(y.iterrows()):
        print(x)
        if x['current'] > 0:
           z.append(x['current'])
        else:
            try:
               z.append(max(z[i-1]-1,0))
            except:
               z.append(0)

    lst.extend(z)


lst

Out[484]: [1, 0, 0, 40, 39, 38, 37, 60, 59, 58, 0, 0, 16, 15, 14, 20, 19, 18]

answered May 1, 2018 at 1:58

BENY

324k22 gold badges176 silver badges250 bronze badges

Collectives™ on Stack Overflow

python - use previous row's value to update the new rows values

6 Answers 6

Details

Basically, creating series, s1 and subtracting series s2 then clipping negative values and filling nan's with zero.

Add series together clip and fillna.

1 Comment

7 Comments

5 Comments

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Details

Basically, creating series, s1 and subtracting series s2 then clipping negative values and filling nan's with zero.

Add series together clip and fillna.

1 Comment

7 Comments

5 Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related