2

My title is not great because I'm having trouble articulating my question. Basically, I have a DateFrame with transactional data consisting of a few DateTime columns and a value column. I need to apply filters to the dates and sum the resulting values in a new DataFrame.

Here is a simplified version of my DateFrame df:

    Sched Week  Ship Week   Ready Week  vals
0   2021-01-04  2021-01-11  2021-01-04  10
1   2021-01-04  2021-01-11  2021-01-04  10
2   2021-01-04  2021-01-04  2021-01-04  2
3   2021-01-07  2021-01-18  2021-01-04  9
4   2021-01-12  2021-01-18  2021-01-11  1
5   2021-01-13  2021-01-11  2021-01-11  6
6   2021-01-13  2021-01-11  2021-01-11  4
7   2021-01-13  2021-01-25  2021-01-11  8
8   2021-01-15  2021-01-25  2021-01-18  4
9   2021-01-19  2021-01-25  2021-01-18  5
10  2021-01-19  2021-01-25  2021-01-18  6
11  2021-01-21  2021-01-25  2021-01-18  10
12  2021-01-21  2021-01-25  2021-01-18  6

The new DataFrame df_result I want to create should look like this based on the values in df. The Sched Week column in this DataFrame is simply df['Sched Week'].unique() and foo is the sum of df['values'] for the rows that meet the conditions below.

    Sched Week  foo
0   2021-01-04  20
1   2021-01-07  29
2   2021-01-12  10
3   2021-01-13  18
4   2021-01-15  18
5   2021-01-19  23
6   2021-01-21  39

And here is the basic logic to generate the new DataFrame:

df['Sched Week'] <= df_result['Sched Week'] &
df['Ship Week'] > df_result['Sched Week'] &
df['Ready Week'] <= df_result['Sched Week']

This test needs to be performed for each row in the new df_result DataFrame and the values summed.

So, the 20 at index 0 is the sum of the values at index 0 and 1 from the original df, since those rows meet the conditions for 2021-01-04.

I have tried every way to boolean mask and groupby that I can think of but nothing I've done so far has worked.

EDIT

Here is the equivalent in Excel.

The formula in cell J3 is =SUMIFS(F:F,C:C,"<="&I3,D:D,">"&I3,E:E,"<="&I3)

DataFrames represented in Excel

4
  • 1
    @sammywemmy I was attempting to explain the conditions with those inequalities. In Excel terms this is just a SUMIFS formula with three conditions. I also double checked the value for 2021-01-19 it is correct. It is the sum of values at index 7-10 because that is where all three of the date conditions are True. Commented Sep 27, 2021 at 22:55
  • 1
    edited my question to show the solution in Excel Commented Sep 28, 2021 at 0:29
  • 1
    the table on the right df_result is the one I want to create from the one on the left df. And df_result['Sched Week'] = df['Sched Week'].unique(). Commented Sep 28, 2021 at 1:05
  • An accurate title would be "Aggregate dataframe values by 'Scheduled Week', 'Ship Week', 'Ready Week'" Commented Mar 12, 2022 at 3:50

2 Answers 2

1

I kept digging and found a solution to my question with a lot of help from this answer from kait

def usr(x):
    mask = df['Sched Week'] <= x['Sched Week']
    mask &= df['Ship Week'] > x['Sched Week']
    mask &= df['Ready Week'] <= x['Sched Week']
    x['foo'] = df[mask].vals.sum()
    return x

df_result.apply(lambda x: usr(x), axis=1)
Sign up to request clarification or add additional context in comments.

1 Comment

efficient as well, since all the steps within the function are vectorized
0

One option is to use an inequality join to get the relevant rows, sum the values with a groupby before merging back to the original dataframe. The conditional_join from pyjanitor offers an efficient implementation for non-equi joins - under the hood it uses binary search, which is faster/bettern than iterating through every row (a cartesian join) - this becomes even more evident for large data:

# pip install pyjanitor
import janitor
import pandas as pd

df = pd.read_clipboard(sep=r'\s{2,}', 
                       engine='python', 
                       parse_dates = ['Sched Week', 'Ship Week', 'Ready Week'])

dff = df['Sched Week'].drop_duplicates()

(df
.conditional_join(
    dff, 
    ('Sched Week', 'Sched Week', '<='), 
    ('Ship Week', 'Sched Week', '>'), 
    ('Ready Week', 'Sched Week', '<='), 
    how = 'right', 
    df_columns='vals')
.groupby('Sched Week', sort=False, as_index=False)
.sum(numeric_only=True)
)
  Sched Week  vals
0 2021-01-04    20
1 2021-01-07    29
2 2021-01-12    10
3 2021-01-13    18
4 2021-01-15    18
5 2021-01-19    23
6 2021-01-21    39

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.