Efficient JOIN in Postgres

Question

I have three large-ish tables, A (10 million rows), A_C (60 million rows) and C (6 rows). They are related via bigint fields and with primary foreign key relations. The tables have a n:m relation, this is why I added the A_C table. The schema is as follow:

A.id = A_C.a_id and C.id = A_C.c_id

A is the primary table here. I want to display the first 100 entries from A and all corresponding rows from C via A_C. Currently the query takes very long. My first approach were two simple joins. Running EXPLAIN ANALYZE I found out, that SQL does a sequential scan on A. I think this is because it applies the limit at last and simply disregards it. This performs very poorly. Is there any way to work around this LIMIT or another solution I am missing here? When I perform a LIMIT on a simple select query, it performs just fine in ~2ms.

Edit: A.id is primary key and has a btree index. A_C has a unique constraint on the entire row and also an index on A_C.a_id and A_C.c_id. I have not yet joined C as it is already really slow.

Fast queries:

SELECT A.id FROM A
SELECT A.id FROM A LIMIT 10

Slow query:

The following is fast in query planning but slow in execution since the data amount is very large:

SELECT A.id FROM A JOIN A_C ON A.id = A_C.a_id

I tried to avoid that by setting a limit on how many rows from A are retrieved:

SELECT a.id FROM (SELECT A.id AS id FROM metrics.metrics LIMIT 10) a JOIN A_C ON (a.id = A_C.a_id)

The query plan via EXPLAIN ANALYZE looks as follows:

"Hash Join  (cost=5.75..1305897.53 rows=44790854 width=8) (actual time=251969.648..251969.648 rows=0 loops=1)"
"  Hash Cond: (a_c.a_id = a.id)"
"  ->  Seq Scan on a_c  (cost=0.00..690017.54 rows=44790854 width=8) (actual time=0.472..126203.253 rows=44790000 loops=1)"
"  ->  Hash  (cost=4.50..4.50 rows=100 width=8) (actual time=1.889..1.889 rows=100 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 12kB"
"        ->  Limit  (cost=0.00..3.50 rows=100 width=8) (actual time=0.527..1.550 rows=100 loops=1)"
"              ->  Seq Scan on a  (cost=0.00..261275.88 rows=7465088 width=8) (actual time=0.520..0.912 rows=100 loops=1)"
"Planning time: 16.350 ms"
"Execution time: 251969.693 ms"

What indexes do your tables have? And what actual query did you run? — Wyzard
– Wyzard, Commented Aug 7, 2017 at 1:37
Without seeing the actual query, I don't think anyone, even Gordon Linoff, could give an answer here. Note that if you were joining A to A_C and there were no WHERE clause, then a sequential scan of A might not be avoidable. — Tim Biegeleisen
– Tim Biegeleisen, Commented Aug 7, 2017 at 1:45

Gordon Linoff · Accepted Answer · 2017-08-07 02:03:07Z

2

I would suggest a subquery for the rows from a:

select . . .
from (select a.*
      from a
      order by a.id
      fetch first 100 rows only    -- or limit 100
     ) a join
     a_c ac
     on ac.a_id = a.id join
     c
     on ac.c_id = c.id;

Even though c is really small, I would encourage a primary key on c(id) if you don't already have one.

This assumes that "the first hundred rows" means by the ordering of the primary key.

edited Aug 7, 2017 at 2:03

answered Aug 7, 2017 at 1:50

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Freddy Over a year ago

Is fetch first 10 rows only different from limit 10?

Gordon Linoff Over a year ago

@Freddy . . . Only in being the ANSI-compliant version.

Freddy Over a year ago

thanks. I will try that out tomorrow, if that makes a difference.

Freddy Over a year ago

Works perfectly!

Collectives™ on Stack Overflow

Efficient JOIN in Postgres

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related