0

I have the following table structure:

create table transfers
(
    id serial not null
        constraint transactions_pkey
            primary key,
    name varchar(255) not null,
    money integer not null
);

create index transfers_name_index
    on transfers (name);

When executing the following query it is quite slow as it does a sequential scan:

EXPLAIN ANALYZE SELECT name
FROM transfers
GROUP by name
ORDER BY name ASC;

Group  (cost=37860.49..41388.54 rows=14802 width=15) (actual time=4285.530..7459.872 rows=999766 loops=1)
  Group Key: name
  ->  Gather Merge  (cost=37860.49..41314.53 rows=29604 width=15) (actual time=4285.529..7136.432 rows=999935 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=36860.46..36897.47 rows=14802 width=15) (actual time=4104.159..5107.148 rows=333312 loops=3)
              Sort Key: name
              Sort Method: external merge  Disk: 14928kB
              Worker 0:  Sort Method: external merge  Disk: 13616kB
              Worker 1:  Sort Method: external merge  Disk: 13656kB
              ->  Partial HashAggregate  (cost=35687.15..35835.17 rows=14802 width=15) (actual time=604.984..689.111 rows=333312 loops=3)
                    Group Key: name
                    ->  Parallel Seq Scan on transfers  (cost=0.00..32571.52 rows=1246252 width=15) (actual time=0.063..200.548 rows=997032 loops=3)
Planning Time: 0.088 ms
Execution Time: 7531.142 ms

However when setting seqscan to off, the index only scan is correctly used, as I would expect.

SET enable_seqscan = OFF;

EXPLAIN ANALYZE SELECT name
FROM transfers
GROUP by name
ORDER BY name ASC;

Group  (cost=1000.45..100492.67 rows=14802 width=15) (actual time=8.032..2212.538 rows=999766 loops=1)
  Group Key: name
  ->  Gather Merge  (cost=1000.45..100418.66 rows=29604 width=15) (actual time=8.029..1880.388 rows=999778 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Group  (cost=0.43..96001.60 rows=14802 width=15) (actual time=0.074..383.471 rows=333259 loops=3)
              Group Key: name
              ->  Parallel Index Only Scan using transfers_name_index on transfers  (cost=0.43..92885.97 rows=1246252 width=15) (actual time=0.066..189.436 rows=997032 loops=3)
                    Heap Fetches: 0
Planning Time: 0.197 ms
Execution Time: 2279.321 ms

Why does Postgres not use the more efficient index only scan without forcing it? The table contains about 3 million records. Am using PostgreSQL 11.2.

13
  • @a_horse_with_no_name already, tried that, doesn't seem to make a difference. Version added in opeing post. Commented Nov 12, 2019 at 17:27
  • Your query wants all the records. It would need all the record for an index-only scan, too. (but maybe the rowsize could differ?) Commented Nov 12, 2019 at 17:31
  • 1
    Mybe your random_page_cost is set too high.(factory default is 4.0, for ssd / NAS you can lower it to below 2) Commented Nov 12, 2019 at 18:10
  • 1
    There are several parameters about query planner As I know for modern SSD devices random_page_cost should be 2 Note that it could to be set at runtime, so just before your query execute set random_page_cost to 2; Commented Nov 12, 2019 at 18:15
  • 1
    "Group (cost=37860.49..41388.54 rows=14802 width=15) (actual time=4285.530..7459.872 rows=999766 loops=1)" Do you know why this estimate is so wrong? Commented Nov 12, 2019 at 21:17

3 Answers 3

2

For postgres to prefer the index only scan, most of the pages should be visible. You can check this in pg_class:

SELECT relpages, relallvisible FROM pg_class WHERE relname='transfers';

If relallvisible is 0 or much lower than relpages, you should VACUUM the table:

VACUUM ANALYZE transfers;
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you, already tried vacuuming. Executing the query returns 20109 for both columns. But since we are only selecting data that is an index (name column), we are not actually accessing the heap so is visibility really a concern here?
Yes, it's definitely a concern. If the pages are not visible to all, postgres will need to access the heap to check visibility. There is some discussion about that here: postgresql.org/docs/current/indexes-index-only-scans.html
thanks. But since relpages equals relallvisible, this doesn't seem to be the issue here, right?
1

Try adding a decent amount of data and run the queries again. Postgres doesn't always use the index and may decide it will be quicker to do a scan if there are only a few records in the table.

1 Comment

great, I have seen the indexes ignored when there are only a few rows
1

When I fill your table with 3e6 rows containing 1e6 distinct names, I get the index only scan. However, if I force the distinct value estimate to match yours, it switches to the seq scan:

alter table transfers alter name set (N_DISTINCT = 14802);
analyze transfers;

So if you use the same method to set it to the correct value, I bet yours would switch the other way.

Why is it wrong in the first place? I bet your table is clustered on name, and your default_statistics_target is too low.

1 Comment

This seems like the most likely cause at this point. There are a few names that appear accross a lot records, and also a lot of names that only appear across 1 record. I'll see if I can play with the table statistics myself

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.