Why is Postgres not using index on a simple GROUP BY?

Question

I have created a 36M rows table with an index on type column:

CREATE TABLE items AS
  SELECT
    (random()*36000000)::integer AS id,
    (random()*10000)::integer AS type,
    md5(random()::text) AS s
  FROM
    generate_series(1,36000000);
CREATE INDEX items_type_idx ON items USING btree ("type");

I run this simple query and expect postgresql to use my index:

explain select count(*) from "items" group by "type";

But the query planner decides to use Seq Scan instead:

HashAggregate  (cost=734592.00..734627.90 rows=3590 width=12) (actual time=6477.913..6478.344 rows=3601 loops=1)
  Group Key: type
  ->  Seq Scan on items  (cost=0.00..554593.00 rows=35999800 width=4) (actual time=0.044..1820.522 rows=36000000 loops=1)
Planning time: 0.107 ms
Execution time: 6478.525 ms

Time without EXPLAIN: 5s 979ms

I have tried several solutions from here and here:

Run VACUUM ANALYZE or VACUUM ANALYZE
Configure default_statistics_target, random_page_cost, work_mem

but nothing helps apart from setting enable_seqscan = OFF:

SET enable_seqscan = OFF;
explain select count(*) from "items" group by "type";

GroupAggregate  (cost=0.56..1114880.46 rows=3590 width=12) (actual time=5.637..5256.406 rows=3601 loops=1)
  Group Key: type
  ->  Index Only Scan using items_type_idx on items  (cost=0.56..934845.56 rows=35999800 width=4) (actual time=0.074..2783.896 rows=36000000 loops=1)
        Heap Fetches: 0
Planning time: 0.103 ms
Execution time: 5256.667 ms

Time without EXPLAIN: 659ms

Query with index scan is about 10x faster on my machine.

Is there a better solution than setting enable_seqscan?

UPD1

My postgresql version is 9.6.3, work_mem = 4MB (tried 64MB), random_page_cost = 4 (tried 1.1), max_parallel_workers_per_gather = 0 (tried 4).

UPD2

I have tried to fill type column not with random numbers, but with i / 10000 to make pg_stats.correlation = 1 - still seqscan.

UPD3

@jgh is 100% right:

This typically only happens when the table's row width is much wider than some indexes

I've made large column data and now postgres use index. Thanks everyone!

What's your PostgreSQL version? Also, please provide output of EXPLAIN ANALYZE. — Michel Milezzi
– Michel Milezzi, Commented Jul 6, 2017 at 18:10
dbfiddle.uk/… - just compare actual timing for enable_seqscan = ON and enable_seqscan = OFF — Abelisto
– Abelisto, Commented Jul 6, 2017 at 18:18
What are your settings for work_mem and random_page_cost? [and : why doesn't the table have a Primary Key?] — wildplasser
– wildplasser, Commented Jul 6, 2017 at 18:41
Note:I have work_mem very low (4M) and random_page_cost= 1.1 (I have an SSD), and I get an index scan without fiddling the switches. — wildplasser
– wildplasser, Commented Jul 6, 2017 at 19:03
I think this index can be meaningfull on requests like select "type", count(*) from "items" group by "type"; Why to use group by to count all records? — JustAnotherCoder
– JustAnotherCoder, Commented Mar 10, 2021 at 11:58

JGH · Accepted Answer · 2017-07-06 18:52:19Z

11

The Index-only scans wiki says

It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.

and

Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle

Accordingly, your index is not considered "significantly smaller" and the entire dataset is to be read, which leads the planner in using a seq scan

answered Jul 6, 2017 at 18:52

JGH

18.2k5 gold badges40 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Austin A Over a year ago

This is good to know, but still a little confusing. I have a table that is 56GB and a column index that is 1.5GB. When I run a group by using only the indexed column, the query planner still bypasses the index and uses a seq scan.

Collectives™ on Stack Overflow

Why is Postgres not using index on a simple GROUP BY?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related