4

Working on a problem, I have the following dataframe in python

    week    hour    week_hr     store_code  baskets
0   201616  106     201616106   505         0
1   201616  107     201616107   505         0
2   201616  108     201616108   505         0
3   201616  109     201616109   505         18
4   201616  110     201616110   505         0
5   201616  106     201616108   910         0
6   201616  107     201616106   910         0
7   201616  108     201616107   910         2
8   201616  109     201616108   910         3
9   201616  110     201616109   910         10

Here "hour" variable is a concat of "weekday" and "hour of shop", example weekday is monday=1 and hour of shop is 6am then hour variable = 106, similarly cal_hr is a concat of week and hour. I want to get those rows where i see a trend of no baskets , i.e 0 baskets for rolling 3 weeks. in the above case i will only get the first 3 rows. i.e. for store 505 there is a continuous cycle of 1 baskets from 106 to 108. But i do not want the rows (4,5,6) because even though there are 0 baskets for 3 continuous hours but the hours are actually NOT continuous. 110 -> 106 -> 107 . For the hours to be continuous they should lie in the range of 106 - 110.. Essentially i want all stores and the respective rows if it has 0 baskets for continuous 3 hours on any given day. Dummy output

    week    hour    week_hr     store_code  baskets
0   201616  106     201616106   505         0
1   201616  107     201616107   505         0
2   201616  108     201616108   505         0

Can i do this in python using pandas and loops? The dataset requires sorting by store and hour. Completely new to python (

2
  • Additional info, can i loop in the following way, loop through each hour, but for each hour check for the next 3 hours, if the baskets in those hours are 0 , then return those rows, Commented Jul 22, 2016 at 18:29
  • I would suggest first getting a list with all element that have baskets = 0. (create a dictionnary with store_code as keys, and list of hours where baskets = 0 as value. Then, find any three or more elements in a row for the list corresponding to each store_code, and finally using the list of 'good hours & store_code' to get the full data from the db. I'll try to do a more detailed answer Commented Jul 22, 2016 at 18:30

2 Answers 2

1

Do the following:

  1. Sort by store_code, week_hr
  2. Filter by 0
  3. Store the subtraction between df['week_hr'][1:].values-df['week_hr'][:-1].values so you will get to know if they are continuos.
  4. Now you can give groups to continuous and filter as you want.

    import numpy as np
    import pandas as pd
    
    # 1
    t1 = df.sort_values(['store_code', 'week_hr'])
    
    # 2
    t2 = t1[t1['baskets'] == 0]
    
    # 3
    continuous = t2['week_hr'][1:].values-t2['week_hr'][:-1].values == 1
    groups = np.cumsum(np.hstack([False, continuous==False]))
    t2['groups'] = groups
    
    # 4
    t3 = t2.groupby(['store_code', 'groups'], as_index=False)['week_hr'].count()
    t4 = t3[t3.week_hr > 2]
    print pd.merge(t2, t4[['store_code', 'groups']])
    

There's no need for looping!

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks a lot. Will the above code look for pattern of 3 or more consecutive occurences of 0. Can i create an index for each week day and hour. Example . For week 201616 and hour of monday 106 and so on.i can create a variable that has index. So the sequence become actual number. Say for 201616106 value is 1 then 201616107 value is 2..and so on..or may be i dont even need to do it? Can you please explain in detail. Still learning python. Newbie here
The code I provided will work for that dataframe you specified and will detect 3 or more consecutive 0's for the same store_code, week_hr. The relevant code is t4 = t3[t3.week_hr > 2] where week_hr is the count of consecutive occurrences (so it is filtering more than 2 consecutive occurrences). There's no need to create index, the week_hr will work just fine with groupby clause. Have you run it? Did it work?
this has worked great , just that i am unable to understand the part after #3. What does hstack do ? Can you please explain. Thanks a lot for your help
When you subtract next-current your actual array will have a -1 length. Ex: 4,5,6,7 will lead to True,True,True, so I have to hstack to add the first element as False to be able to add a column with the same length.
0

You can solve:

  1. Sort by store_code, week_hr
  2. Filter by 0
  3. Group by store_code
  4. Find continuous

Code:

t1 = df.sort_values(['store_code', 'week_hr'])

t2 = t1[t1['baskets'] == 0]

grouped = t2.groupby('store_code')['week_hr'].apply(lambda x: x.tolist())    

for store_code, week_hrs in grouped.iteritems():
    print(store_code, week_hrs)
    # do something

1 Comment

Thanks a lot cuong. Will the above code look for pattern of 3 or more consecutive occurences of 0. What does apply do ? What is tolist().what is iteritems? Can you please explain in detail. Still learning python. Newbie here.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.