3

The problem I need to solve:

In order to calculate the number of hours per day that are used for (public) holidays or days of illness, the average working hours are used from the previous 3 months (with a starting value of 8 hours per day).

The tricky part is that the calculated value of the previous month will need to be factored in, meaning if there was a public holiday last month, which had been assigned a calculated value of 8.5 hours, these calculated hours will influence the average working hours per day for that last month, which then is being used to assigned working hours to current months' holidays.

So far I only have come up with the following, which doesn't factor in the row-by-row calculation, yet:

WITH
    const (h_target, h_extra) AS (VALUES (8.0, 20)),
    monthly_sums (c_month, d_work, d_off, h_work) AS (VALUES
        ('2018-12', 16, 5, 150.25),
        ('2019-01', 20, 3, 171.25),
        ('2019-02', 15, 5, 120.5)
    ),
    calc AS (
        SELECT
            ms.*,
            (ms.d_work + ms.d_off) AS d_total,
            (ms.h_work + ms.d_off * const.h_target) AS h_total,
            (avg((ms.h_work + ms.d_off * const.h_target) / (ms.d_work + ms.d_off))
                OVER (ORDER BY ms.c_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW))::numeric(10,2)
                AS h_off
        FROM monthly_sums AS ms
        CROSS JOIN const
    )
SELECT
    calc.c_month,
    calc.d_work,
    calc.d_off,
    calc.d_total,
    calc.h_work,
    calc.h_off,
    (d_off * lag(h_off, 1, const.h_target) OVER (ORDER BY c_month)) AS h_off_sum,
    (h_work + d_off * lag(h_off, 1, const.h_target) OVER (ORDER BY c_month)) AS h_sum
FROM calc CROSS JOIN const;

...giving the following result:

 c_month | d_work | d_off | d_total | h_work | h_off | h_off_sum | h_sum  
---------+--------+-------+---------+--------+-------+-----------+--------
 2018-12 |     16 |     5 |      21 | 150.25 |  9.06 |      40.0 | 190.25
 2019-01 |     20 |     3 |      23 | 171.25 |  8.77 |     27.18 | 198.43
 2019-02 |     15 |     5 |      20 |  120.5 |  8.52 |     43.85 | 164.35
(3 rows)

This calculates correctly for the first row and for the second row for columns that rely on previous row values (lag) but the average hours per day calculation is obviously wrong as I couldn't figure out how to feed the current row value (h_sum) back into the calculation for the new h_off.

The desired result should be as follows:

 c_month | d_work | d_off | d_total | h_work | h_off | h_off_sum | h_sum  
---------+--------+-------+---------+--------+-------+-----------+--------
 2018-12 |     16 |     5 |      21 | 150.25 |  9.06 |      40.0 | 190.25
 2019-01 |     20 |     3 |      23 | 171.25 |  8.84 |     27.18 | 198.43
 2019-02 |     15 |     5 |      20 |  120.5 |  8.64 |      44.2 |  164.7
(3 rows)

...meaning h_off is used for next months' h_off_sum and resulting h_sum and h_sum's of available months (at most three) in turn result into the calculation of current months' h_off (essentially avg(h_sum / d_total) over up to three months).

So, actual calculation is:

 c_month | calculation                                        | h_off
---------+----------------------------------------------------+-------
         |                                                    |  8.00 << initial
               .---------------------- uses ---------------------^
 2018-12 | ((190.25 / 21)) / 1                                |  9.06
                               .------------ uses ---------------^
 2019-01 | ((190.25 / 21) + (198.43 / 23)) / 2                |  8.84
                                               .--- uses --------^
 2019-02 | ((190.25 / 21) + (198.43 / 23) + (164.7 / 20)) / 3 |  8.64

P.S.: I am using PostgreSQL 11, so I have the latest features at hands if that makes any difference.

1 Answer 1

1

I wasn't able to solve that inter-column + inter-row calculation problem with the use of window functions at all and not without falling back to a special use of a recursive CTE as well as introducing special-purpose columns for the days (d_total_1) and hours (h_sum_1) of the 3rd historical month (as you cannot join in the recursive temporary table more than once).

In addition, I added a 4th row to the input data and used an additional index column which I can refer to when joining, which is usually made up with a sub-query like this:

SELECT ROW_NUMBER() OVER (ORDER BY c_month) AS row_num, * FROM monthly_sums

So, here's my take at it:

WITH RECURSIVE calc AS (
        SELECT 
            monthly_sums.row_num,
            monthly_sums.c_month,
            monthly_sums.d_work,
            monthly_sums.d_off,
            monthly_sums.h_work,
            (monthly_sums.d_off * 8)::numeric(10,2) AS h_off_sum,
            monthly_sums.d_work + monthly_sums.d_off AS d_total,
            0.0 AS d_total_1,
            (monthly_sums.h_work + monthly_sums.d_off * 8)::numeric(10,2) AS h_sum,
            0.0 AS h_sum_1,
            (
                (monthly_sums.h_work + monthly_sums.d_off * 8)
                /
                (monthly_sums.d_work + monthly_sums.d_off)
            )::numeric(10,2) AS h_off
        FROM
            (
                SELECT * FROM (VALUES
                    (1, '2018-12', 16, 5, 150.25),
                    (2, '2019-01', 20, 3, 171.25),
                    (3, '2019-02', 15, 5, 120.5),
                    (4, '2019-03', 19, 2, 131.75)
                ) AS tmp (row_num, c_month, d_work, d_off, h_work)
            ) AS monthly_sums
        WHERE
            monthly_sums.row_num = 1
    UNION ALL
        SELECT
            monthly_sums.row_num,
            monthly_sums.c_month,
            monthly_sums.d_work,
            monthly_sums.d_off,
            monthly_sums.h_work,
            lat_off.h_off_sum::numeric(10,2),
            lat_days.d_total,
            calc.d_total AS d_total_1,
            lat_sum.h_sum::numeric(10,2),
            calc.h_sum AS h_sum_1,
            lat_calc.h_off::numeric(10,2)
        FROM
            (
                SELECT * FROM (VALUES
                    (1, '2018-12', 16, 5, 150.25),
                    (2, '2019-01', 20, 3, 171.25),
                    (3, '2019-02', 15, 5, 120.5),
                    (4, '2019-03', 19, 2, 131.75)
                ) AS tmp (row_num, c_month, d_work, d_off, h_work)
            ) AS monthly_sums
            INNER JOIN calc ON (calc.row_num = monthly_sums.row_num - 1),
            LATERAL (SELECT monthly_sums.d_work + monthly_sums.d_off AS d_total) AS lat_days,
            LATERAL (SELECT monthly_sums.d_off * calc.h_off AS h_off_sum) AS lat_off,
            LATERAL (SELECT monthly_sums.h_work + lat_off.h_off_sum AS h_sum) AS lat_sum,
            LATERAL (SELECT
                (calc.h_sum_1 + calc.h_sum + lat_sum.h_sum)
                /
                (calc.d_total_1 + calc.d_total + lat_days.d_total)
                AS h_off
            ) AS lat_calc
        WHERE
            monthly_sums.row_num > 1
    )
SELECT c_month, d_work, d_off, d_total, h_work, h_off, h_off_sum, h_sum FROM calc
;

...which gives:

 c_month | d_work | d_off | d_total | h_work | h_off | h_off_sum | h_sum  
---------+--------+-------+---------+--------+-------+-----------+--------
 2018-12 |     16 |     5 |      21 | 150.25 |  9.06 |     40.00 | 190.25
 2019-01 |     20 |     3 |      23 | 171.25 |  8.83 |     27.18 | 198.43
 2019-02 |     15 |     5 |      20 |  120.5 |  8.65 |     44.15 | 164.65
 2019-03 |     19 |     2 |      21 | 131.75 |  8.00 |     17.30 | 149.05
(4 rows)

(PostgreSQL's default type conversion behavior is to round numeric values and so the result is slightly different than initially expected but actually correct)

Please note that PostgreSQL is generally pretty picky about data types and refuses to process queries like this whenever there is a discrepancy that could potentially lead to loss of precision (e.g. numeric vs. integer), which is why I have used explicit types for the columns in both places.

One of the final pieces of the puzzle was solved by using LATERAL subqueries, which enables me to have one calculation reference the result of a previous one and even shift around columns in the final output independent of the calculation hierarchy.

If anyone can come up with a simpler variant I'd be happy to learn about it.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.