1

I have the following dataframe in Pandas. Score and Date_of_interest columns are to be calculated. Below it is already filled out to make the explanation of the problem easy.

First let's assume that Score and Date_of_interest columns are filled with NaN's only. Below are the steps to fill the values in them.

a) We are trying to get one date of interest, based on the criteria described below for one PC_id eg. PC_id 200 has 1998-04-10 02:25:00 and so on.

b) To solve this problem we take the PC_id column and check each row to find the change in Item_id, each has a score of 1. For the same Item_id like in 1st row and second row, has 1 and 1 so the value starts with 1 but does not change in second row.

c) While moving and calculating the score for the second row it also checks the Datetime difference, if the previous one is more than 24 hours old, it is dropped and score is reset to 1 and cursor moves to third row.

d) When the Score reaches 2, we have reached the qualifying score as in row no 5(index 4) and we copy the corresponding Datetime in Date_of_interest column.

e) We start the new cycle for new PC_id as in row six.

   Datetime        Item_id     PC_id       Value     Score    Date_of_interest

0   1998-04-8 01:00:00   1      200          35         1       NaN
1   1998-04-8 02:00:00   1      200          92         1       NaN
2   1998-04-10 02:00:00  2      200          35         1       NaN
3   1998-04-10 02:15:00  2      200          92         1       NaN
4   1998-04-10 02:25:00  3      200          92         2     1998-04-10 02:25:00

5   1998-04-10 03:00:00  1      201          93         1       NaN
6   1998-04-12 03:30:00  3      201          94         1       NaN
7   1998-04-12 04:00:00  4      201          95         2       NaN
8   1998-04-12 04:00:00  4      201          26         2     1998-04-12 04:00:00
9   1998-04-12 04:30:00  2      201          98         3       NaN

10  1998-04-12 04:50:00  1      202         100         1       NaN
11  1998-04-15 05:00:00  4      202         100         1       NaN
12  1998-04-15 05:15:00  3      202         100         2   1998-04-15 05:15:00
13  1998-04-15 05:30:00  2      202         100         3       NaN
14  1998-04-15 06:00:00  3      202         100         NaN     NaN
15  1998-04-15 06:00:00  3      202         222         NaN     NaN

Final table should be as follows:

    PC_id      Date_of_interest  

0   200       1998-04-10 02:25:00
1   201       1998-04-12 04:00:00
2   202       1998-04-15 05:15:00

Thanks for helping.

Update : Code I am working on currently:

df_merged_unique = df_merged['PC_id'].unique()
score = 0

for i, row in df_merged.iterrows():
    for elem in df_merged_unique:
        first_date = row['Datetime']
        first_item = 0
        if row['PC_id'] == elem:
            if row['Score'] < 2:
                if row['Item_id'] != first_item:
                    if row['Datetime']-first_date <= pd.datetime.timedelta(days=1):
                        score += 1
                        row['Score'] = score
                        first_date = row['Datetime']
                    else:
                        pass
                else:
                    pass
            else:
                row['Date_of_interest'] = row['Datetime']
                break
        else:
            pass
4
  • 2
    So what's the question and do you have any actual code? Commented Aug 28, 2016 at 22:07
  • 1st table is my working table. 2nd table is the output I want. Commented Aug 28, 2016 at 22:10
  • Do you have some code to perform the transformation and a specific problem with said code? For example do you need help in how to drop the rows in the 1st dataframe that have NaN as Date_of_interest? Commented Aug 28, 2016 at 22:12
  • @Ilja Everila : I am trying to calculatenScore and Date_of_interest and aggregate the second table(optional) as the final output. I have updated my code. Thanks Commented Aug 28, 2016 at 22:32

1 Answer 1

1

Usually having to resort to iterative/imperative methods is a sign of trouble when working with pandas. Given the dataframe

In [111]: df2
Out[111]: 
              Datetime  Item_id  PC_id  Value
0  1998-04-08 01:00:00        1    200     35
1  1998-04-08 02:00:00        1    200     92
2  1998-04-10 02:00:00        2    200     35
3  1998-04-10 02:15:00        2    200     92
4  1998-04-10 02:25:00        3    200     92
5  1998-04-10 03:00:00        1    201     93
6  1998-04-12 03:30:00        3    201     94
7  1998-04-12 04:00:00        4    201     95
8  1998-04-12 04:00:00        4    201     26
9  1998-04-12 04:30:00        2    201     98
10 1998-04-12 04:50:00        1    202    100
11 1998-04-15 05:00:00        4    202    100
12 1998-04-15 05:15:00        3    202    100
13 1998-04-15 05:30:00        2    202    100
14 1998-04-15 06:00:00        3    202    100
15 1998-04-15 06:00:00        3    202    222

you could first group by PC_id

In [112]: the_group = df2.groupby('PC_id')

and then apply the search using diff() to get the rows where Item_id and Datetime change appropriately

In [357]: (the_group['Item_id'].diff() != 0) & \
     ...: (the_group['Datetime'].diff() <= timedelta(days=1))
Out[357]: 
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7      True
8     False
9      True
10    False
11    False
12     True
13     True
14     True
15    False
16    False
dtype: bool

and then just take the first date (first match) in each group, if any

In [341]: df2[(the_group['Item_id'].diff() != 0) &
     ...:     (the_group['Datetime'].diff() <= timedelta(days=1))]\
     ...: .groupby('PC_id').first()['Datetime'].reset_index()
Out[341]: 
   PC_id            Datetime
0    200 1998-04-10 02:25:00
1    201 1998-04-12 04:00:00
2    202 1998-04-15 05:15:00
Sign up to request clarification or add additional context in comments.

3 Comments

@ Ilja Everila : Wow! Very short code and quick. But I get the following error on my main dataframe : IndexError: single positional indexer is out-of-bounds. Any ideas?
Ah of course, the edge cases; that's the result of a group having 0 matches and .iloc[0]. Will update the answer to take that into account.
@ Ilja Everila : Worked like a charm. You saved me a lot of time. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.