0

How can I filter queries by date to prevent massive sequential scan on a large database?

My survey app collects responses and each answer to a question in a survey is stored in a table response_answer.

When I query all response_answers for the month I filter by date; however postgres is running a sequential scan on all response_answers (which is in the millions), and it is slow.

The query:

explain analyse 
  select count(*)
    from response_answer
    left join response r on r.id = response_answer.response_id
    where r.date_recorded between '2019-08-01T00:00:00.000Z' and '2019-08-29T23:59:59.999Z';
QUERY PLAN
Aggregate  (cost=517661.09..517661.10 rows=1 width=8) (actual time=139362.882..139362.899 rows=1 loops=1)
  ->  Hash Join  (cost=8063.39..517565.30 rows=38316 width=0) (actual time=126512.031..136806.093 rows=316558 loops=1)
        Hash Cond: (response_answer.response_id = r.id)
        ->  Seq Scan on response_answer  (cost=0.00..480365.73 rows=7667473 width=4) (actual time=1.443..70216.817 rows=7667473 loops=1)
        ->  Hash  (cost=8053.35..8053.35 rows=803 width=4) (actual time=173.467..173.476 rows=7010 loops=1)
              Buckets: 8192 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 311kB
              ->  Seq Scan on response r  (cost=0.00..8053.35 rows=803 width=4) (actual time=0.489..107.417 rows=7010 loops=1)
                    Filter: ((date_recorded >= '2019-08-01'::date) AND (observed_at <= '2019-08-29'::date))
                    Rows Removed by Filter: 153682
Planning time: 21.310 ms
Execution time: 139373.365 ms

I do have indexes on response_answer(response_id), response_answer(id), and response(id).

As the system grows, this query will become so slow it will be unusable because the sequential scan will continue to take longer.

When dealing with large amounts of data, how should I design queries/tables so that the database doesn't have to run a sequential scan of every. single. row. Surely there's a way for Postgres to only consider responses in the date range before finding all the related response_answers?

1 Answer 1

1

You need indexes on

response (date_recorded, id)

and

response_answer (response_id)

VACUUM the tables for an index only scan.

With a query like this, you don't need an outer join. PostgreSQL is smart enough to infer that from the fact that response.id cannot be NULL.

Sign up to request clarification or add additional context in comments.

3 Comments

Okay I'll try vacuuming and adding the missing index. Do you mean to just use a regular join as opposed to an outer join?
Yes. If r.id = response_answer.response_id is true, r.id cannot be NULL.
Wow, that did it. After I vacuumed both tables and added the 2 column index it brought a 28 second query down to a 2.6 second query. Thanks Laurenz!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.