Building a new Pandas DataFrame based on dates from another DataFrame

Question

My title is not great because I'm having trouble articulating my question. Basically, I have a DateFrame with transactional data consisting of a few DateTime columns and a value column. I need to apply filters to the dates and sum the resulting values in a new DataFrame.

Here is a simplified version of my DateFrame df:

    Sched Week  Ship Week   Ready Week  vals
0   2021-01-04  2021-01-11  2021-01-04  10
1   2021-01-04  2021-01-11  2021-01-04  10
2   2021-01-04  2021-01-04  2021-01-04  2
3   2021-01-07  2021-01-18  2021-01-04  9
4   2021-01-12  2021-01-18  2021-01-11  1
5   2021-01-13  2021-01-11  2021-01-11  6
6   2021-01-13  2021-01-11  2021-01-11  4
7   2021-01-13  2021-01-25  2021-01-11  8
8   2021-01-15  2021-01-25  2021-01-18  4
9   2021-01-19  2021-01-25  2021-01-18  5
10  2021-01-19  2021-01-25  2021-01-18  6
11  2021-01-21  2021-01-25  2021-01-18  10
12  2021-01-21  2021-01-25  2021-01-18  6

The new DataFrame df_result I want to create should look like this based on the values in df. The Sched Week column in this DataFrame is simply df['Sched Week'].unique() and foo is the sum of df['values'] for the rows that meet the conditions below.

    Sched Week  foo
0   2021-01-04  20
1   2021-01-07  29
2   2021-01-12  10
3   2021-01-13  18
4   2021-01-15  18
5   2021-01-19  23
6   2021-01-21  39

And here is the basic logic to generate the new DataFrame:

df['Sched Week'] <= df_result['Sched Week'] &
df['Ship Week'] > df_result['Sched Week'] &
df['Ready Week'] <= df_result['Sched Week']

This test needs to be performed for each row in the new df_result DataFrame and the values summed.

So, the 20 at index 0 is the sum of the values at index 0 and 1 from the original df, since those rows meet the conditions for 2021-01-04.

I have tried every way to boolean mask and groupby that I can think of but nothing I've done so far has worked.

EDIT

Here is the equivalent in Excel.

The formula in cell J3 is =SUMIFS(F:F,C:C,"<="&I3,D:D,">"&I3,E:E,"<="&I3)

DataFrames represented in Excel

@sammywemmy I was attempting to explain the conditions with those inequalities. In Excel terms this is just a SUMIFS formula with three conditions. I also double checked the value for 2021-01-19 it is correct. It is the sum of values at index 7-10 because that is where all three of the date conditions are True. — abrn
– abrn, Commented Sep 27, 2021 at 22:55
the table on the right df_result is the one I want to create from the one on the left df. And df_result['Sched Week'] = df['Sched Week'].unique(). — abrn
– abrn, Commented Sep 28, 2021 at 1:05
An accurate title would be "Aggregate dataframe values by 'Scheduled Week', 'Ship Week', 'Ready Week'" — smci
– smci, Commented Mar 12, 2022 at 3:50

abrn · Accepted Answer · 2021-09-28 02:00:32Z

1

I kept digging and found a solution to my question with a lot of help from this answer from kait

def usr(x):
    mask = df['Sched Week'] <= x['Sched Week']
    mask &= df['Ship Week'] > x['Sched Week']
    mask &= df['Ready Week'] <= x['Sched Week']
    x['foo'] = df[mask].vals.sum()
    return x

df_result.apply(lambda x: usr(x), axis=1)

answered Sep 28, 2021 at 2:00

abrn

337 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

sammywemmy Over a year ago

efficient as well, since all the steps within the function are vectorized

sammywemmy · Accepted Answer · 2023-06-09 06:26:53Z

One option is to use an inequality join to get the relevant rows, sum the values with a groupby before merging back to the original dataframe. The conditional_join from pyjanitor offers an efficient implementation for non-equi joins - under the hood it uses binary search, which is faster/bettern than iterating through every row (a cartesian join) - this becomes even more evident for large data:

# pip install pyjanitor
import janitor
import pandas as pd

df = pd.read_clipboard(sep=r'\s{2,}', 
                       engine='python', 
                       parse_dates = ['Sched Week', 'Ship Week', 'Ready Week'])

dff = df['Sched Week'].drop_duplicates()

(df
.conditional_join(
    dff, 
    ('Sched Week', 'Sched Week', '<='), 
    ('Ship Week', 'Sched Week', '>'), 
    ('Ready Week', 'Sched Week', '<='), 
    how = 'right', 
    df_columns='vals')
.groupby('Sched Week', sort=False, as_index=False)
.sum(numeric_only=True)
)
  Sched Week  vals
0 2021-01-04    20
1 2021-01-07    29
2 2021-01-12    10
3 2021-01-13    18
4 2021-01-15    18
5 2021-01-19    23
6 2021-01-21    39

Collectives™ on Stack Overflow

Building a new Pandas DataFrame based on dates from another DataFrame

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related