1

I have three large-ish tables, A (10 million rows), A_C (60 million rows) and C (6 rows). They are related via bigint fields and with primary foreign key relations. The tables have a n:m relation, this is why I added the A_C table. The schema is as follow:

A.id = A_C.a_id and C.id = A_C.c_id

A is the primary table here. I want to display the first 100 entries from A and all corresponding rows from C via A_C. Currently the query takes very long. My first approach were two simple joins. Running EXPLAIN ANALYZE I found out, that SQL does a sequential scan on A. I think this is because it applies the limit at last and simply disregards it. This performs very poorly. Is there any way to work around this LIMIT or another solution I am missing here? When I perform a LIMIT on a simple select query, it performs just fine in ~2ms.

Edit: A.id is primary key and has a btree index. A_C has a unique constraint on the entire row and also an index on A_C.a_id and A_C.c_id. I have not yet joined C as it is already really slow.

Fast queries:

  • SELECT A.id FROM A
  • SELECT A.id FROM A LIMIT 10

Slow query:

The following is fast in query planning but slow in execution since the data amount is very large:

  • SELECT A.id FROM A JOIN A_C ON A.id = A_C.a_id

I tried to avoid that by setting a limit on how many rows from A are retrieved:

SELECT a.id FROM (SELECT A.id AS id FROM metrics.metrics LIMIT 10) a JOIN A_C ON (a.id = A_C.a_id)

The query plan via EXPLAIN ANALYZE looks as follows:

"Hash Join  (cost=5.75..1305897.53 rows=44790854 width=8) (actual time=251969.648..251969.648 rows=0 loops=1)"
"  Hash Cond: (a_c.a_id = a.id)"
"  ->  Seq Scan on a_c  (cost=0.00..690017.54 rows=44790854 width=8) (actual time=0.472..126203.253 rows=44790000 loops=1)"
"  ->  Hash  (cost=4.50..4.50 rows=100 width=8) (actual time=1.889..1.889 rows=100 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 12kB"
"        ->  Limit  (cost=0.00..3.50 rows=100 width=8) (actual time=0.527..1.550 rows=100 loops=1)"
"              ->  Seq Scan on a  (cost=0.00..261275.88 rows=7465088 width=8) (actual time=0.520..0.912 rows=100 loops=1)"
"Planning time: 16.350 ms"
"Execution time: 251969.693 ms"
4
  • What indexes do your tables have? And what actual query did you run? Commented Aug 7, 2017 at 1:37
  • Updated question. Commented Aug 7, 2017 at 1:39
  • 1
    Without seeing the actual query, I don't think anyone, even Gordon Linoff, could give an answer here. Note that if you were joining A to A_C and there were no WHERE clause, then a sequential scan of A might not be avoidable. Commented Aug 7, 2017 at 1:45
  • wiki.postgresql.org/wiki/Slow_Query_Questions Commented Aug 7, 2017 at 1:51

1 Answer 1

2

I would suggest a subquery for the rows from a:

select . . .
from (select a.*
      from a
      order by a.id
      fetch first 100 rows only    -- or limit 100
     ) a join
     a_c ac
     on ac.a_id = a.id join
     c
     on ac.c_id = c.id;

Even though c is really small, I would encourage a primary key on c(id) if you don't already have one.

This assumes that "the first hundred rows" means by the ordering of the primary key.

Sign up to request clarification or add additional context in comments.

4 Comments

Is fetch first 10 rows only different from limit 10?
@Freddy . . . Only in being the ANSI-compliant version.
thanks. I will try that out tomorrow, if that makes a difference.
Works perfectly!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.