Selecting values from pandas data frame using multiple conditions

Question

I have the following dataframe in Pandas. Score and Date_of_interest columns are to be calculated. Below it is already filled out to make the explanation of the problem easy.

First let's assume that Score and Date_of_interest columns are filled with NaN's only. Below are the steps to fill the values in them.

a) We are trying to get one date of interest, based on the criteria described below for one PC_id eg. PC_id 200 has 1998-04-10 02:25:00 and so on.

b) To solve this problem we take the PC_id column and check each row to find the change in Item_id, each has a score of 1. For the same Item_id like in 1st row and second row, has 1 and 1 so the value starts with 1 but does not change in second row.

c) While moving and calculating the score for the second row it also checks the Datetime difference, if the previous one is more than 24 hours old, it is dropped and score is reset to 1 and cursor moves to third row.

d) When the Score reaches 2, we have reached the qualifying score as in row no 5(index 4) and we copy the corresponding Datetime in Date_of_interest column.

e) We start the new cycle for new PC_id as in row six.

   Datetime        Item_id     PC_id       Value     Score    Date_of_interest

0   1998-04-8 01:00:00   1      200          35         1       NaN
1   1998-04-8 02:00:00   1      200          92         1       NaN
2   1998-04-10 02:00:00  2      200          35         1       NaN
3   1998-04-10 02:15:00  2      200          92         1       NaN
4   1998-04-10 02:25:00  3      200          92         2     1998-04-10 02:25:00

5   1998-04-10 03:00:00  1      201          93         1       NaN
6   1998-04-12 03:30:00  3      201          94         1       NaN
7   1998-04-12 04:00:00  4      201          95         2       NaN
8   1998-04-12 04:00:00  4      201          26         2     1998-04-12 04:00:00
9   1998-04-12 04:30:00  2      201          98         3       NaN

10  1998-04-12 04:50:00  1      202         100         1       NaN
11  1998-04-15 05:00:00  4      202         100         1       NaN
12  1998-04-15 05:15:00  3      202         100         2   1998-04-15 05:15:00
13  1998-04-15 05:30:00  2      202         100         3       NaN
14  1998-04-15 06:00:00  3      202         100         NaN     NaN
15  1998-04-15 06:00:00  3      202         222         NaN     NaN

Final table should be as follows:

    PC_id      Date_of_interest  

0   200       1998-04-10 02:25:00
1   201       1998-04-12 04:00:00
2   202       1998-04-15 05:15:00

Thanks for helping.

Update : Code I am working on currently:

df_merged_unique = df_merged['PC_id'].unique()
score = 0

for i, row in df_merged.iterrows():
    for elem in df_merged_unique:
        first_date = row['Datetime']
        first_item = 0
        if row['PC_id'] == elem:
            if row['Score'] < 2:
                if row['Item_id'] != first_item:
                    if row['Datetime']-first_date <= pd.datetime.timedelta(days=1):
                        score += 1
                        row['Score'] = score
                        first_date = row['Datetime']
                    else:
                        pass
                else:
                    pass
            else:
                row['Date_of_interest'] = row['Datetime']
                break
        else:
            pass

1st table is my working table. 2nd table is the output I want. — user2949037
– user2949037, Commented Aug 28, 2016 at 22:10
Do you have some code to perform the transformation and a specific problem with said code? For example do you need help in how to drop the rows in the 1st dataframe that have NaN as Date_of_interest? — Ilja Everilä
– Ilja Everilä, Commented Aug 28, 2016 at 22:12
@Ilja Everila : I am trying to calculatenScore and Date_of_interest and aggregate the second table(optional) as the final output. I have updated my code. Thanks — user2949037
– user2949037, Commented Aug 28, 2016 at 22:32

Ilja Everilä · Accepted Answer · 2016-08-29 10:18:50Z

1

Usually having to resort to iterative/imperative methods is a sign of trouble when working with pandas. Given the dataframe

In [111]: df2
Out[111]: 
              Datetime  Item_id  PC_id  Value
0  1998-04-08 01:00:00        1    200     35
1  1998-04-08 02:00:00        1    200     92
2  1998-04-10 02:00:00        2    200     35
3  1998-04-10 02:15:00        2    200     92
4  1998-04-10 02:25:00        3    200     92
5  1998-04-10 03:00:00        1    201     93
6  1998-04-12 03:30:00        3    201     94
7  1998-04-12 04:00:00        4    201     95
8  1998-04-12 04:00:00        4    201     26
9  1998-04-12 04:30:00        2    201     98
10 1998-04-12 04:50:00        1    202    100
11 1998-04-15 05:00:00        4    202    100
12 1998-04-15 05:15:00        3    202    100
13 1998-04-15 05:30:00        2    202    100
14 1998-04-15 06:00:00        3    202    100
15 1998-04-15 06:00:00        3    202    222

you could first group by PC_id

In [112]: the_group = df2.groupby('PC_id')

and then apply the search using diff() to get the rows where Item_id and Datetime change appropriately

In [357]: (the_group['Item_id'].diff() != 0) & \
     ...: (the_group['Datetime'].diff() <= timedelta(days=1))
Out[357]: 
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7      True
8     False
9      True
10    False
11    False
12     True
13     True
14     True
15    False
16    False
dtype: bool

and then just take the first date (first match) in each group, if any

In [341]: df2[(the_group['Item_id'].diff() != 0) &
     ...:     (the_group['Datetime'].diff() <= timedelta(days=1))]\
     ...: .groupby('PC_id').first()['Datetime'].reset_index()
Out[341]: 
   PC_id            Datetime
0    200 1998-04-10 02:25:00
1    201 1998-04-12 04:00:00
2    202 1998-04-15 05:15:00

edited Aug 29, 2016 at 10:18

answered Aug 28, 2016 at 23:25

Ilja Everilä

53.4k9 gold badges137 silver badges141 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user2949037 Over a year ago

@ Ilja Everila : Wow! Very short code and quick. But I get the following error on my main dataframe : IndexError: single positional indexer is out-of-bounds. Any ideas?

Ilja Everilä Over a year ago

Ah of course, the edge cases; that's the result of a group having 0 matches and .iloc[0]. Will update the answer to take that into account.

user2949037 Over a year ago

@ Ilja Everila : Worked like a charm. You saved me a lot of time. Thank you.

Collectives™ on Stack Overflow

Selecting values from pandas data frame using multiple conditions

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related