0

I have a table with several million rows called item with columns that look like this:

CREATE TABLE item (
  id bigint NOT NULL,
  company_id bigint NOT NULL,
  date_created timestamp with time zone,
  ....
)

There is an index on company_id

CREATE INDEX idx_company_id ON photo USING btree (company_id);

This table is often searched for the last 10 items for a certain customer, i.e.,

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;

Currently, there is one customer that accounts for about 75% of the data in that table, the other 25% of the data is spread across 25 or so other customers, meaning that 75% of the rows have a company id of 5, the other rows have company ids between 6 and 25.

The query generally runs very fast for all companies except the predominant one (id = 5). I can understand why since the index on company_id can be used for companies except 5.

I have experimented with different indexes to make this search more efficient for company 5. The one that seemed to make the most sense is

CREATE INDEX idx_date_created
ON item (date_created DESC NULLS LAST);

If I add this index, queries for the predominant company (id = 5) are greatly improved, but queries for all other companies go to crap.

Some results of EXPLAIN ANALYZE for company id 5 & 6 with and without the new index:

Company Id 5

Before new index

QUERY PLAN
Limit  (cost=214874.63..214874.65 rows=10 width=639) (actual time=10481.989..10482.017 rows=10 loops=1)
  ->  Sort  (cost=214874.63..218560.33 rows=1474282 width=639) (actual time=10481.985..10481.994 rows=10 loops=1)
        Sort Key: photo_created
        Sort Method: top-N heapsort  Memory: 35kB
        ->  Seq Scan on photo  (cost=0.00..183015.92 rows=1474282 width=639) (actual time=0.009..5345.551 rows=1473561 loops=1)
              Filter: (company_id = 5)
              Rows Removed by Filter: 402513
Total runtime: 10482.075 ms

After new index:

QUERY PLAN
Limit  (cost=0.43..1.98 rows=10 width=639) (actual time=0.087..0.120 rows=10 loops=1)
  ->  Index Scan using idx_photo__photo_created on photo  (cost=0.43..228408.04 rows=1474282 width=639) (actual time=0.084..0.099 rows=10 loops=1)
        Filter: (company_id = 5)
        Rows Removed by Filter: 26
Total runtime: 0.164 ms

Company Id 6

Before new index:

QUERY PLAN
Limit  (cost=2204.27..2204.30 rows=10 width=639) (actual time=0.044..0.053 rows=3 loops=1)
  ->  Sort  (cost=2204.27..2207.55 rows=1310 width=639) (actual time=0.040..0.044 rows=3 loops=1)
        Sort Key: photo_created
        Sort Method: quicksort  Memory: 28kB
        ->  Index Scan using idx_photo__company_id on photo  (cost=0.43..2175.96 rows=1310 width=639) (actual time=0.020..0.026 rows=3 loops=1)
              Index Cond: (company_id = 6)
Total runtime: 0.100 ms

After new index:

QUERY PLAN
Limit  (cost=0.43..1744.00 rows=10 width=639) (actual time=0.039..3938.986 rows=3 loops=1)
  ->  Index Scan using idx_photo__photo_created on photo  (cost=0.43..228408.04 rows=1310 width=639) (actual time=0.035..3938.975 rows=3 loops=1)
        Filter: (company_id = 6)
        Rows Removed by Filter: 1876071
Total runtime: 3939.028 ms

I have run a full VACUUM and ANALYZE on the table, so PostgreSQL should have up-to-date statistics. Any ideas how I can get PostgreSQL to choose the right index for the company being queried?

8
  • My guess is that LIMIT is cheating. But will be more clear if you provide EXPLAIN with ANALYZE, it will help us to inspect table statistics used for the planner. BTW, are you running regular VACUUM ANALYZE? Commented Jun 22, 2017 at 20:37
  • How many distinct company_ids are there? What percentage of the table is company_id = 5? Commented Jun 22, 2017 at 20:38
  • Edited my post to add more detail, thanks for your help Commented Jun 22, 2017 at 21:42
  • what happens when you create an index over both the columns? CREATE INDEX idx ON item (company_id, date_created); Commented Jun 22, 2017 at 21:52
  • If I create an index on both columns, it is not used for any queries (either 5 or 6). Commented Jun 22, 2017 at 21:57

2 Answers 2

3

This is known as the "abort-early plan problem", and it's been a chronic mis-optimization for years. Abort-early plans are amazing when they work, but terrible when they don't; see that linked mailing list thread for a more detailed explanation. Basically, the planner thinks it'll find the 10 rows you want for customer 6 without scanning the whole date_created index, and it's wrong.

There isn't any hard-and-fast way to improve this query categorically prior to PostgreSQL 10 (not in beta). What you'll want to do is nudge the query planner in various ways in hopes of getting what you want. Primary methods include anything which makes PostgreSQL more likely to use multi-column indexes, such as:

It's also possible that you may be able to fix the planner behavior by playing with the table statistics. This includes:

  • raising statistics_target for the table and running ANALYZE again, in order to make PostgreSQL take more samples and get a better picture of row distribution;
  • increasing n_distinct in the stats to accurately reflect the number of customer_ids or different created_dates.

However, all of these solutions are approximate, and if query performance goes to heck as your data changes in the future, this should be the first query you look at.

In PostgreSQL 10, you'll be able to create Cross-Column Stats which should improve the situation more reliably. Depending on how broken this is for you, you could try using the beta.

If none of that works, I suggest the #postgresql IRC channel on Freenode or the pgsql-performance mailing list. Folks there will ask for your detailed table stats in order to make some suggestions.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the explanation, bummer that it appears to be a "bug" in PostgreSQL. I can address this specific situation with a partial index. Kind of a cludge, but for now will buy me time.
0

Yet another point: Why do you create index

CREATE INDEX idx_date_created ON item (date_created DESC NULLS LAST);

But call:

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;

May be you mean

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created DESC NULLS LAST LIMIT 10;

Also is better to create combine index:

CREATE INDEX idx_company_id_date_created ON item (company_id, date_created DESC NULLS LAST);

And after that:

                                                                     QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..28.11 rows=10 width=16) (actual time=0.120..0.153 rows=10 loops=1)
   ->  Index Only Scan using idx_company_id_date_created on item  (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.118..0.145 rows=10 loops=1)
         Index Cond: (company_id = 5)
         Heap Fetches: 10
 Planning time: 1.003 ms
 Execution time: 0.209 ms
(6 rows)
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..28.11 rows=10 width=16) (actual time=0.085..0.115 rows=10 loops=1)
   ->  Index Only Scan using idx_company_id_date_created on item  (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.084..0.108 rows=10 loops=1)
         Index Cond: (company_id = 6)
         Heap Fetches: 10
 Planning time: 0.136 ms
 Execution time: 0.155 ms
(6 rows)

On your server it might be slower but in any case much better than in above examples.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.