6

This is the current dataframe:

> ID        Date    current
> 2001980   10/30/2017  1   
> 2001980   10/29/2017  0   
> 2001980   10/28/2017  0   
> 2001980   10/27/2017  40  
> 2001980   10/26/2017  39  
> 2001980   10/25/2017  0   
> 2001980   10/24/2017  0   
> 2001980   10/23/2017  60  
> 2001980   10/22/2017  0   
> 2001980   10/21/2017  0   
> 2002222   10/21/2017  0   
> 2002222   10/20/2017  0   
> 2002222   10/19/2017  16  
> 2002222   10/18/2017  0   
> 2002222   10/17/2017  0   
> 2002222   10/16/2017  20  
> 2002222   10/15/2017  19  
> 2002222   10/14/2017  18  

Below is the final data frame. Column expected is what I am trying to get.

  1. One ID might have multiple date/record/rows. (ID+Date) is unique.
  2. this row's expected value = last row's expected - 1
  3. the minimum value is 0.
  4. Based on the formula in 2, if this row's expected value < this row's current value, then use this row's current value. for example, for ID 2001980 on 10/23/2017. Based on rule 2, the value should be 36, but based on rule 4, 36<60, so we use 60.

thank you so much.

> ID        Date    current expected 
> 2001980   10/30/2017  1   1 
> 2001980   10/29/2017  0   0
> 2001980   10/28/2017  0   0 
> 2001980   10/27/2017  40  40
> 2001980   10/26/2017  39  39 
> 2001980   10/25/2017  0   38
> 2001980   10/24/2017  0   37 
> 2001980   10/23/2017  60  60
> 2001980   10/22/2017  0   59 
> 2001980   10/21/2017  0   58
> 2002222   10/21/2017  0   0
> 2002222   10/20/2017  0   0 
> 2002222   10/19/2017  16  16
> 2002222   10/18/2017  0   15 
> 2002222   10/17/2017  0   14
> 2002222   10/16/2017  20  20
> 2002222   10/15/2017  19  19
> 2002222   10/14/2017  18  18

I am using Excel with the formula below:

= if(this row's ID = last row's ID, max(last row's expected value - 1, this row's current value), this row's current value)

5
  • well, have you tried anything already and stuck somewhere? because I don't see any code here. Commented Mar 2, 2018 at 17:19
  • Please let us know if you are actually struggling at a particular point. Suggestion: avoid loops when using dataframes for such tasks. Commented Mar 2, 2018 at 18:11
  • I just refine the question, feel free to point out if there is anything unclear, Commented Apr 25, 2018 at 18:20
  • Please clarify why you expect zero at 2002222 10/21/2017, shouldn't it be 57? Commented Apr 30, 2018 at 13:22
  • @igrinis sorry for the misunderstanding. because that is the new ID, every new ID, the most recent one can be any number(>=0), we don't need touch that value. same as 2001980 10/30/2017 1, we don't need to update that value too. Commented Apr 30, 2018 at 21:51

6 Answers 6

4

Revised simpler:

df['expected'] = df.groupby(['ID',df.current.ne(0).cumsum()])['current']\
  .transform(lambda x: x.eq(0).cumsum().mul(-1).add(x.iloc[0])).clip(0,np.inf)

Let's have a little fun:

df['expected'] = (df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill()) +
df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1))\
.clip(0,np.inf).fillna(0).astype(int)
print(df)

Output:

         ID        Date  current  expected
0   2001980  10/30/2017        1         1
1   2001980  10/29/2017        0         0
2   2001980  10/28/2017        0         0
3   2001980  10/27/2017       40        40
4   2001980  10/26/2017       39        39
5   2001980  10/25/2017        0        38
6   2001980  10/24/2017        0        37
7   2001980  10/23/2017       60        60
8   2001980  10/22/2017        0        59
9   2001980  10/21/2017        0        58
10  2002222  10/21/2017        0         0
11  2002222  10/20/2017        0         0
12  2002222  10/19/2017       16        16
13  2002222  10/18/2017        0        15
14  2002222  10/17/2017        0        14
15  2002222  10/16/2017       20        20
16  2002222  10/15/2017       19        19
17  2002222  10/14/2017       18        18

Details

Basically, creating series, s1 and subtracting series s2 then clipping negative values and filling nan's with zero.

#Let's calculate two series first a series to fill the zeros in an 'ID' with the previous non-zero value 
s1 = df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill())
s1

Output:

0      1.0
1      1.0
2      1.0
3     40.0
4     39.0
5     39.0
6     39.0
7     60.0
8     60.0
9     60.0
10     NaN
11     NaN
12    16.0
13    16.0
14    16.0
15    20.0
16    19.0
17    18.0
Name: current, dtype: float64

#Second series is a cumulative count of zeroes in a group by 'ID'
s2 = df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1)
s2

Output:

0     0
1    -1
2    -2
3     0
4     0
5    -1
6    -2
7     0
8    -1
9    -2
10   -1
11   -2
12    0
13   -1
14   -2
15    0
16    0
17    0
Name: current, dtype: int32

Add series together clip and fillna.

(s1 + s2).clip(0, np.inf).fillna(0)

Output:

0      1.0
1      0.0
2      0.0
3     40.0
4     39.0
5     38.0
6     37.0
7     60.0
8     59.0
9     58.0
10     0.0
11     0.0
12    16.0
13    15.0
14    14.0
15    20.0
16    19.0
17    18.0
Name: current, dtype: float64
Sign up to request clarification or add additional context in comments.

1 Comment

@qqqwww run some timings on your million row dataset I am curious to the performance.
3
+50

So you can do this used apply and nested functions

import pandas as pd
ID = [2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2002222,2002222,2002222,2002222,2002222,2002222,2002222,2002222,]
Date = ["10/30/2017","10/29/2017","10/28/2017","10/27/2017","10/26/2017","10/25/2017","10/24/2017","10/23/2017","10/22/2017","10/21/2017","10/21/2017","10/20/2017","10/19/2017","10/18/2017","10/17/2017","10/16/2017","10/15/2017","10/14/2017",]
current = [1 ,0 ,0 ,40,39,0 ,0 ,60,0 ,0 ,0 ,0 ,16,0 ,0 ,20,19,18,]

df = pd.DataFrame({"ID": ID, "Date": Date, "current": current})

Then create the function to update the frame

Python 3.X

def update_frame(df):
    last_expected = None
    def apply_logic(row):
        nonlocal last_expected
        last_row_id = row.name - 1
        if row.name == 0:
            last_expected = row["current"]
            return last_expected
        last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
        last_expected = max(last_expected-1,row['current']) if last_row['ID'] == row['ID'] else row['current']
        return last_expected
    return apply_logic

Python 2.X

def update_frame(df):
    sd = {"last_expected": None}
    def apply_logic(row):
        last_row_id = row.name - 1
        if row.name == 0:
            sd['last_expected'] = row["current"]
            return sd['last_expected']
        last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
        sd['last_expected'] = max(sd['last_expected'] - 1,row['current']) if last_row['ID'] == row['ID'] else row['current']
        return sd['last_expected']
    return apply_logic

And run the function like below

df['expected'] = df.apply(update_frame(df), axis=1)

The output is as expected

Output

7 Comments

hi Tarun, I am using Python 2. it looks like nonlocal has been introduced in Python 3. Also, could I know what is row.name at here?
nonlocal can be easily ported to python 2.7 and row.name gives you the index of that row. Let me update the solution with python2.7
@qqqwww, please check the updated answer with python 2.7 code
thank you. after validation with my major df. all the value matched my expected value. 1. the row in def apply_logic(row) is actually each row in the df? 2. why do you check row.name == 0, is this becuase the last_row_id = -1 at here, so you want make sure the rest of code begin from row[1] and there is always has a last row? 3. I think this part is the key last_row = df.iloc[[last_row_id]].iloc[0].to_dict(), could you please give some explain? 4. why do we need function update_frame? why not use the function apply_logic directly?
1. yes, apply will send each row when you have put axis=1. 2. The row.name == 0 is to return the current value for the first row as expected, because there is no previous row in that case. 3. The last_row = df.iloc[[last_row_id]].iloc[0].to_dict() is to access the last row from df using the index of the row and the get the values as column name to value dictionary mapping. 4. We need update_frame as a nested function so that we can use a shared variable to stored the expected_value for the last result. For Single function, will need to use global var outside the function (bad design)
|
0

You can use a conditional statement combined with .shift() to get previous row, and np.where which AFAIK does not rely on loops as mentioned in a comment as something to avoid:

df['test'] = np.where(df['current'].shift() < 
                      df['current'], df['current'] - 1, df['current'])

Result (I added a 'test' column) with the result; you can change to 'expected' if you so desire):

>>> df
         ID        Date  current  expected  test
0   2001980  10/30/2017        1         1     1
1   2001980  10/29/2017        0         0     0
2   2001980  10/28/2017        0         0     0
3   2001980  10/27/2017       40        40    39
4   2001980  10/26/2017       39        39    39
5   2001980  10/25/2017       38        38    38
6   2001980  10/24/2017       37        37    37
7   2001980  10/18/2017        0        36     0
8   2001980  10/17/2017        0        35     0
9   2001980  10/16/2017       60        60    59
10  2001980  10/15/2017        0        59     0
11  2001980  10/14/2017        0        58     0
12  2001980  10/13/2017        0        57     0
13  2001980  10/12/2017        0        56     0
14  2002222  10/21/2017        0         0     0
15  2002222  10/20/2017        0         0     0
16  2002222  10/19/2017       16        16    15
17  2002222  10/18/2017        0        15     0
18  2002222  10/17/2017        0        14     0
19  2002222  10/16/2017       13        13    12
20  2002222  10/15/2017       12        12    12
21  2002222  10/14/2017       11        11    11
22  2002222  10/13/2017       10        10    10
23  2002222  10/12/2017        9         9     9

5 Comments

Hi Bernie, the column expected is my expected final result. I just manually create this column. What I want is use Python to create this column. My apologies for misunderstanding.
@qqqwww: yep, just change 'test' -> 'expected' and you're good to go!
the value in row 3 should be 40. the value in row 16 should be 16, not 15. the value in row 19 should be 13, not 12.
Then I believe your logic is faulty. I am pretty sure I have implemented what you described your logic should be. Please revisit your logic, and get back to us.
Well, you IMO have completely changed the question with those additional requirements.
0

EDIT: To address OP's concern about scaling up to millions of rows.

Yes, my original answer will not scale to very large dataframes. However, with minor edits, this easy-to-read solution will scale. The optimizations that follow take advantage of the JIT compiler in Numba. After importing Numba, I add the jit decorator and modified the function to operate on a numpy arrays instead of the pandas objects. Numba is numpy-aware and cannot optimize pandas objects.

import numba

@numba.jit
def expected(id_col, current_col):
    lexp = []
    lstID = 0
    expected = 0
    for i in range(len(id_col)):
        id, current = id_col[i], current_col[i]
        if id == lstID:
            expected = max(current, max(expected - 1, 0))
        else:
            expected = current
        lexp.append(expected)
        lstID = id
    return np.array(lexp)

To pass a numpy array to the function, use the .values attribute of the pandas series.

df1['expected'] = expected(df1.ID.values, df1.current.values)

To test the performance, I scaled up your original dataframe to more than 1 million rows.

df1 = df
while len(df1) < 1000000:
    df1 = pd.concat([df1, df1])
df1.reset_index(inplace=True, drop=True)

The new changes perform very well.

%timeit expected(df1.ID.values, df1.current.values)
44.9 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

df1.shape
Out[65]: (1179648, 4)

df1.tail(15)
Out[66]: 
              ID        Date  current  expected
1179633  2001980  10/27/2017       40        40
1179634  2001980  10/26/2017       39        39
1179635  2001980  10/25/2017        0        38
1179636  2001980  10/24/2017        0        37
1179637  2001980  10/23/2017       60        60
1179638  2001980  10/22/2017        0        59
1179639  2001980  10/21/2017        0        58
1179640  2002222  10/21/2017        0         0
1179641  2002222  10/20/2017        0         0
1179642  2002222  10/19/2017       16        16
1179643  2002222  10/18/2017        0        15
1179644  2002222  10/17/2017        0        14
1179645  2002222  10/16/2017       20        20
1179646  2002222  10/15/2017       19        19
1179647  2002222  10/14/2017       18        18

ORIGINAL ANSWER

A little brute force but really easy to follow.

    def expected(df):
        lexp = []
        lstID = None
        expected = 0
        for i in range(len(df)):
            id, current = df[['ID', 'current']].iloc[i]
            if id == lstID:
                expected = max(expected - 1, 0)
                expected = max(current, expected)
            else:
                expected = current
            lexp.append(expected)
            lstID = id
        return pd.Series(lexp)

Output

df['expected'] = expected(df)

df
Out[53]: 
         ID        Date  current  expected
0   2001980  10/30/2017        1         1
1   2001980  10/29/2017        0         0
2   2001980  10/28/2017        0         0
3   2001980  10/27/2017       40        40
4   2001980  10/26/2017       39        39
5   2001980  10/25/2017        0        38
6   2001980  10/24/2017        0        37
7   2001980  10/23/2017       60        60
8   2001980  10/22/2017        0        59
9   2001980  10/21/2017        0        58
10  2002222  10/21/2017        0         0
11  2002222  10/20/2017        0         0
12  2002222  10/19/2017       16        16
13  2002222  10/18/2017        0        15
14  2002222  10/17/2017        0        14
15  2002222  10/16/2017       20        20
16  2002222  10/15/2017       19        19
17  2002222  10/14/2017       18        18

1 Comment

my df has multi-millions rows, I think using for i in range(len(df) might very slow. do you think so?
0

I believe @Tarun Lalwani had pointed you to one right direction. that is to save some critical information outside the DataFrame. the code can be simplified though, and there is nothing wrong with using global variables as long as you manage name properly. it's one of the design patterns which can often make things simpler and improve readability.

cached_last = { 'expected': None, 'ID': None }

def set_expected(x):
    if cached_last['ID'] is None or x.ID != cached_last['ID']:
        expected = x.current
    else:
        expected =  max(cached_last['expected'] - 1, x.current)
    cached_last['ID'] = x.ID
    cached_last['expected'] = expected
    return expected

df['expected'] = df.apply(set_expected, axis=1)

From the documentation on pandas.DataFrame.apply, do be careful about the potential side-effects of the apply function.

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

2 Comments

how did you set up the value for cached_last? Looks like it missed one step before you do if cached_last['ID'] is None or x.ID != cached_last['ID']:
On top: cached_last is defined outside the function set_expected cached_last = { 'expected': None, 'ID': None }. In the first processing row, you will get None, and after that the ID and expected will be set at the end of each call so when processing the next row they return the values from the last row. I think this is a very common code practice. right?
0

Logic here should be work

lst=[]

for _, y in df.groupby('ID'):
    z=[]
    for i,(_, x) in enumerate(y.iterrows()):
        print(x)
        if x['current'] > 0:
           z.append(x['current'])
        else:
            try:
               z.append(max(z[i-1]-1,0))
            except:
               z.append(0)

    lst.extend(z)


lst

Out[484]: [1, 0, 0, 40, 39, 38, 37, 60, 59, 58, 0, 0, 16, 15, 14, 20, 19, 18]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.