2

Here is the query that I am trying,

SELECT s.id, s.name AS name,
CASE WHEN (ARRAY_AGG(tgs.status) @> '{Hard} ') THEN 'Hard'
WHEN (ARRAY_AGG(tgs.status) @> '{Soft} ') THEN 'Soft'
WHEN (ARRAY_AGG(tgs.status) @> '{Fluid} ') THEN 'Fluid'
WHEN (ARRAY_AGG(tgs.status) @> '{Gummy} ') THEN 'Gummy'
WHEN (ARRAY_AGG(tgs.status) @> '{Expired} ') THEN 'Expired'
END AS status, 
COUNT(*) OVER()
FROM sweets AS s 
INNER JOIN tasty_goofy_sweets AS tgs on tgs.sweet_id = s.id
GROUP BY s.id;

While implementing this my friend suggested that, instead of computing array_agg every time in the switch case, we could use LEFT JOIN LATERAL and compute it just once. i.e) to implement like below

SELECT s.id, s.name AS name,
CASE WHEN (tgs.status @> '{Hard} ') THEN 'Hard'
WHEN (tgs.arr_status @> '{Soft} ') THEN 'Soft'
WHEN (tgs.arr_status @> '{Fluid} ') THEN 'Fluid'
WHEN (tgs.arr_status @> '{Gummy} ') THEN 'Gummy'
WHEN (tgs.arr_status @> '{Expired} ') THEN 'Expired'
END AS status, 
COUNT(*) OVER()
FROM sweets AS s 
LEFT JOIN LATERAL ( SELECT ARRAY_AGG(tgs.status) AS arr_status FROM tasty_goofy_sweets tgs WHERE  tgs.sweet_id = s.id
) AS tgs ON TRUE
GROUP BY s.id;

But I am not sure if Postgres computes the ARRAY_AGG value every time, how can we decide which approach is better? I tried looking at explain analyse for both the queries, the number of rows involved in the latter query is more than the former. But I don't understand why this is so?

I intuitively feel the former approach is better, but can someone please reason out, which is better and why or am I asking too much ?

2 Answers 2

2

Most likely, Postgres will optimize away the multiple array_agg()s, compute it only once and reuse the results in each comparison. That's quite simple query optimization, that the database should easily spot.

Let me suggest, however, to simplify the query by using conditional aggreagation. Yo don't neeed to aggregate into an array just to check if a given value is there:

select
    s.id,
    s.name
    case 
        when count(*) filter(where status = 'Hard')    > 0 then 'Hard',
        when count(*) filter(where status = 'Soft')    > 0 then 'Soft',
        when count(*) filter(where status = 'Fluid')   > 0 then 'Fluid'
        when count(*) filter(where status = 'Gummy')   > 0 then 'Gummy',
        when count(*) filter(where status = 'Expired') > 0 then 'Expired'
    end status,
    count(*) over() cnt
from sweets s
inner join tasty_goofy_sweets AS tgs on tgs.sweet_id = s.id
group by s.id;

You could also express this without aggregation, using a lateral join and a conditional sort:

select
    s.id,
    s.name,
    tgs.status,
    count(*) over() cnt
from sweets s
cross join lateral (
    select status
    from tasty_goofy_sweets as tgs 
    where tgs.sweet_id = s.id
    order by case status 
        when 'Hard'    then 1
        when 'Soft'    then 2
        when 'Fluid'   then 3
        when 'Gummy'   then 4
        when 'Expired' then 5
    end
    limit 1
) tgs
Sign up to request clarification or add additional context in comments.

5 Comments

oh! does array_aggregation slow down the query
@LunaLovegood: yes the array aggregation does not seem optimal. I suspect that the fastest method is the lateral join - but you would need to assess that against your real dataset.
@GMB . . . Because of the sequential ordering of case expressions, I would be mildly surprised if the expression is evaluated only once.
@GMB in the second approach that you suggested, why did you go with CROSS JOIN, wont LEFT JOIN be more appropriate given that we have the foreign key relation of tgs.sweet_id = s.id?
@GMB also in the first approach won't the count(*) get calculated for every case statement?
1

I am fairly certain that in a case expression, the when clause is going to be evaluated separately for each condition. That means that your colleague is correct . . . probably.

The operative part of the documentation is:

Each condition is an expression that returns a boolean result. If the condition's result is true, the value of the CASE expression is the result that follows the condition, and the remainder of the CASE expression is not processed.

It is possible that Postgres does do some sort of optimization of subexpressions by evaluating them once. However, the statement: "the remainder of the CASE expression is not processed" is pretty strong and suggests that each clause will only be processed after the previous ones have evaluated to false (or NULL).

Regardless of whether the optimizer picks figures out that a function can be called only once, the lateral join guarantees that it will be evaluated once, so that seems like the safer solution for an expensive operation.

3 Comments

The query with LEFT LATERAL seems to be a bit slower and affect more rows 🤔 Reference: explain.depesz.com/s/xVta The query without LEFT LATERAL explain.depesz.com/s/PZkX Perhaps it varies depending on data. 🤔
@LunaLovegood . . . Do you have an index on tgs.sweet_id?
@LunaLovegood . . . Duh. The equivalent query would use an inner join, not a left join. Either use left join for both or use inner join for both. Also, remove the group by when using a lateral join. It is unnecessary -- but Postgres probably does not optimize it away.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.