Prevent massive sequential scan with datetime query

Question

How can I filter queries by date to prevent massive sequential scan on a large database?

My survey app collects responses and each answer to a question in a survey is stored in a table response_answer.

When I query all response_answers for the month I filter by date; however postgres is running a sequential scan on all response_answers (which is in the millions), and it is slow.

The query:

explain analyse 
  select count(*)
    from response_answer
    left join response r on r.id = response_answer.response_id
    where r.date_recorded between '2019-08-01T00:00:00.000Z' and '2019-08-29T23:59:59.999Z';

QUERY PLAN
Aggregate  (cost=517661.09..517661.10 rows=1 width=8) (actual time=139362.882..139362.899 rows=1 loops=1)
  ->  Hash Join  (cost=8063.39..517565.30 rows=38316 width=0) (actual time=126512.031..136806.093 rows=316558 loops=1)
        Hash Cond: (response_answer.response_id = r.id)
        ->  Seq Scan on response_answer  (cost=0.00..480365.73 rows=7667473 width=4) (actual time=1.443..70216.817 rows=7667473 loops=1)
        ->  Hash  (cost=8053.35..8053.35 rows=803 width=4) (actual time=173.467..173.476 rows=7010 loops=1)
              Buckets: 8192 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 311kB
              ->  Seq Scan on response r  (cost=0.00..8053.35 rows=803 width=4) (actual time=0.489..107.417 rows=7010 loops=1)
                    Filter: ((date_recorded >= '2019-08-01'::date) AND (observed_at <= '2019-08-29'::date))
                    Rows Removed by Filter: 153682
Planning time: 21.310 ms
Execution time: 139373.365 ms

I do have indexes on response_answer(response_id), response_answer(id), and response(id).

As the system grows, this query will become so slow it will be unusable because the sequential scan will continue to take longer.

When dealing with large amounts of data, how should I design queries/tables so that the database doesn't have to run a sequential scan of every. single. row. Surely there's a way for Postgres to only consider responses in the date range before finding all the related response_answers?

Laurenz Albe · Accepted Answer · 2019-08-30 22:04:29Z

1

You need indexes on

response (date_recorded, id)

and

response_answer (response_id)

VACUUM the tables for an index only scan.

With a query like this, you don't need an outer join. PostgreSQL is smart enough to infer that from the fact that response.id cannot be NULL.

answered Aug 30, 2019 at 22:04

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jordan Over a year ago

Okay I'll try vacuuming and adding the missing index. Do you mean to just use a regular join as opposed to an outer join?

Laurenz Albe Over a year ago

Yes. If r.id = response_answer.response_id is true, r.id cannot be NULL.

Jordan Over a year ago

Wow, that did it. After I vacuumed both tables and added the 2 column index it brought a 28 second query down to a 2.6 second query. Thanks Laurenz!

Collectives™ on Stack Overflow

Prevent massive sequential scan with datetime query

How can I filter queries by date to prevent massive sequential scan on a large database?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

How can I filter queries by date to prevent massive sequential scan on a large database?

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related