1

Postgres is using a Nested Loop Join algorithm when I use a non equi join condition in my update query. I understand that the Nested Loop Join can be very costly as the right relation is scanned once for every row found in the left relation as per [https://www.postgresql.org/docs/8.3/planner-optimizer.html]

The update query and the execution plan is below.

Query

explain analyze
UPDATE target_tbl tgt
set descr     = stage.descr,
    prod_name = stage.prod_name,
    item_name = stage.item_name,
    url       = stage.url,
    col1_name = stage.col1_name,
    col2_name = stage.col2_name,
    col3_name = stage.col3_name,
    col4_name = stage.col4_name,
    col5_name = stage.col5_name,
    col6_name = stage.col6_name,
    col7_name = stage.col7_name,
    col8_name = stage.col8_name,
    flag      = stage.flag
from tbl1 stage
where tgt.col1 = stage.col1
  and tgt.col2 = stage.col2
  and coalesce(tgt.col3, 'col3'::text) = coalesce(stage.col3, 'col3'::text)
  and coalesce(tgt.col4, 'col4'::text) = coalesce(stage.col4, 'col4'::text)
  and stage.row_number::int >= 1::int
  and stage.row_number::int < 50001::int;

Execution Plan

Update on target_tbl tgt  (cost=0.56..3557.91 rows=1 width=813) (actual time=346153.460..346153.460 rows=0 loops=1)
  ->  Nested Loop  (cost=0.56..3557.91 rows=1 width=813) (actual time=4.326..163876.029 rows=50000 loops=1)
        ->  Seq Scan on tbl1 stage  (cost=0.00..2680.96 rows=102 width=759) (actual time=3.060..2588.745 rows=50000 loops=1)
              Filter: (((row_number)::integer >= 1) AND ((row_number)::integer < 50001))
        ->  Index Scan using tbl_idx on target_tbl tgt  (cost=0.56..8.59 rows=1 width=134) (actual time=3.152..3.212 rows=1 loops=50000)
              Index Cond: ((col1 = stage.col1) AND (col2 = stage.col2) AND (COALESCE(col3, 'col3'::text) = COALESCE(stage.col3, 'col3'::text)) AND (COALESCE(col4, 'col4'::text) = COALESCE(stage.col4, 'col4'::text)))
Planning time: 17.700 ms
Execution time: 346157.168 ms
  1. Is there any way to avoid the nested loop join during the execution of the above query?

  2. Or is there a way that can help me to reduce the cost of the the nested loop scan, currently it takes 6-7 minutes to update just 50000 records?

2
  • 1
    Do you have an index on tbl(col1, col2)? Commented Nov 2, 2020 at 15:27
  • No @a_horse_with_no_name Currently I only have index on target_tbl on col1, col2, COALESCE(col3, 'col3'::text), COALESCE(col4, 'col4'::text) Commented Nov 2, 2020 at 15:35

2 Answers 2

1

PostgreSQL can choose a different join strategy in that case. The reason why it doesn't is the gross mis-estimate in the sequential scan: 102 instead of 50000.

Fix that problem, and things will get better:

ANALYZE tbl1;

If that is not enough, collect more detailed statistics:

ALTER TABLE tbl1 ALTER row_number SET STATISTICS 1000;
ANALYZE tbl1;

All this assumes that row_number is an integer and the type cast is redundant. If you made the mistake to use a different data type, an index is your only hope:

CREATE INDEX ON tbl1 ((row_number::integer));
ANALYZE tbl1;
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot @Laurenz Albe. Creating the index on the row_number column has reduced the time to 2 mins
1

I understand that the Nested Loop Join can be very costly as the right relation is scanned once for every row found in the left relation

But the "right relation" here is an index scan, not a scan of the full table.

You can get it to stop using the index by changing the leading column of the join condition to something like where tgt.col1+0 = stage.col1 .... Upon doing that, it will probably change to a hash join or a merge join, but you will have to try it and see if it does. Also, the new plan may not actually be faster. (And fixing the estimation problem would be preferable, if that works)

Or is there a way that can help me to reduce the cost of the the nested loop scan, currently it takes 6-7 minutes to update just 50000 records?

Your plan shows that over half the time is spent on the update itself, so probably reducing the cost of just the nested loop scan can have only a muted impact on the overall time. Do you have a lot of indexes on the table? The maintenance of those indexes might be a major bottleneck.

1 Comment

Thanks a lot for your help @jjanes,you are correct, there's a lot of indexes on target table but unfortunately I don't have permission to drop or rebuild them.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.