PostgreSQL aggregate before join vs after join performance difference

Question

I have 3 tables:

create table cart (
  id       bigserial primary key,
  buyer_id bigint unique not null
);


create table contact_person (
  id           bigserial primary key,
  cart_id      bigint references cart (id) not null unique,
  phone_number jsonb,
  first_name   VARCHAR,
  middle_name  VARCHAR,
  last_name    VARCHAR
);

create table cart_items (
  id      bigserial primary key,
  item_id bigint                      not null,
  cart_id bigint references cart (id) not null,
  count   int                         not null,
  unique (item_id, cart_id)
);

cart:contact_person related as 1:1
cart:cart_items 1:N

And i want to aggregate all cart_items fields by cart id. There are 2 options:

1) Aggregate before join:

select c.id       as id,
               c.buyer_id as buyer_id,
               cp.id      as contact_id,
               cp.phone_number,
               cp.first_name,
               cp.middle_name,
               cp.last_name,
               ci.ids, ci.item_ids, ci.counts
        from cart c
               inner join contact_person cp on c.id = cp.cart_id
               left join (select cart_id, array_agg(id) as ids, array_agg(item_id) as item_ids, array_agg(count) as counts
                          from cart_items ci
                          group by cart_id) ci on ci.cart_id = c.id
        where c.buyer_id = :buyerId;

2) aggregate after join:

select c.id       as id,
               c.buyer_id as buyer_id,
               cp.id      as contact_id,
               cp.phone_number,
               cp.first_name,
               cp.middle_name,
               cp.last_name,
               array_agg(ci.id) as ids,
               array_agg(ci.item_id) as item_ids,
               array_agg(ci.count) as counts
        from cart c
               inner join contact_person cp on c.id = cp.cart_id
               left join cart_items ci on ci.cart_id = c.id
        where c.buyer_id = :buyerId
group by c.id, cp.id;

And as Explain shows, the query with aggregation after join much faster. The query plans are really different, but I can not explain why in the case of aggregation before they have such a high cost.

1) aggregate before:

Nested Loop  (cost=108.97..141.16 rows=1 width=248)
  ->  Merge Left Join  (cost=108.82..132.96 rows=1 width=112)
        Merge Cond: (c.id = ci.cart_id)
        ->  Sort  (cost=8.18..8.19 rows=1 width=16)
              Sort Key: c.id
              ->  Index Scan using cart_buyer_id_key on cart c  (cost=0.15..8.17 rows=1 width=16)
                    Index Cond: (buyer_id = 1)
        ->  GroupAggregate  (cost=100.64..122.26 rows=200 width=104)
              Group Key: ci.cart_id
              ->  Sort  (cost=100.64..104.26 rows=1450 width=28)
                    Sort Key: ci.cart_id
                    ->  Seq Scan on cart_items ci  (cost=0.00..24.50 rows=1450 width=28)
  ->  Index Scan using contact_person_cart_id_key on contact_person cp  (cost=0.15..8.17 rows=1 width=144)
        Index Cond: (cart_id = c.id)

2) aggregate after:

GroupAggregate  (cost=41.62..41.66 rows=1 width=248)
  Group Key: c.id, cp.id
  ->  Sort  (cost=41.62..41.63 rows=1 width=172)
        Sort Key: c.id, cp.id
        ->  Nested Loop Left Join  (cost=15.33..41.61 rows=1 width=172)
              ->  Nested Loop  (cost=0.30..16.37 rows=1 width=152)
                    ->  Index Scan using cart_buyer_id_key on cart c  (cost=0.15..8.17 rows=1 width=16)
                          Index Cond: (buyer_id = 1)
                    ->  Index Scan using contact_person_cart_id_key on contact_person cp  (cost=0.15..8.17 rows=1 width=144)
                          Index Cond: (cart_id = c.id)
              ->  Bitmap Heap Scan on cart_items ci  (cost=15.03..25.17 rows=7 width=28)
                    Recheck Cond: (cart_id = c.id)
                    ->  Bitmap Index Scan on cart_items_item_id_cart_id_key  (cost=0.00..15.03 rows=7 width=0)
                          Index Cond: (cart_id = c.id)

I thought of adding an index on cart_id field to cart_items, this effectively accelerated the queries, but that in the first case, as in the second. How can you explain this difference?

[as you found out yourself] There is no supporting index for the FK cart_items.cart_id --> carts.id (this probably causes the need for a sort step) Note: the queries are both relatively small, cost-based planning does not work well for small numbers. — joop
– joop, Commented Aug 27, 2018 at 15:54

Joe Love · Accepted Answer · 2018-08-27 19:50:35Z

1

Think of it this way: In your before example, you're joining a table and an "on the fly" view, having to be generated BEFORE it can be joined.

In your "after" example, you're joining 2 tables and then aggregating. The join itself is faster and doesn't need to be created, sorted, etc. Aggregating data AFTER you've collected it all should be faster when you're not eliminating any rows.. and the join is so much more simple anyway.

answered Aug 27, 2018 at 19:50

Joe Love

6,1262 gold badges24 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Peter Kozlovsky Over a year ago

Sounds logical. But if the number of joins is growing, then for some reason the option with aggregation before join is much faster. For example, I had a question when I used the aggregation after and in the answer was given a variant with aggregation before join and he came out faster. And this gap in speed increased when the number of joins is grow up. [stackoverflow.com/questions/51825480/…

Joe Love Over a year ago

I'll see if I can replicate it. Planners can do weird magic, which is why I've always been a fan of Oracle's "hinting" system to tell the optimizer what you want done in certain scenarios. Postgres has fought that hard, and usually you just adjust the syntax until the optimizer does what you want.. but again, some of the things involved get VERY complex because when you get several tables involved, there are lots of statistics to consider.

Collectives™ on Stack Overflow

PostgreSQL aggregate before join vs after join performance difference

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related