I have a table with several million rows called item with columns that look like this:
CREATE TABLE item (
id bigint NOT NULL,
company_id bigint NOT NULL,
date_created timestamp with time zone,
....
)
There is an index on company_id
CREATE INDEX idx_company_id ON photo USING btree (company_id);
This table is often searched for the last 10 items for a certain customer, i.e.,
SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;
Currently, there is one customer that accounts for about 75% of the data in that table, the other 25% of the data is spread across 25 or so other customers, meaning that 75% of the rows have a company id of 5, the other rows have company ids between 6 and 25.
The query generally runs very fast for all companies except the predominant one (id = 5). I can understand why since the index on company_id can be used for companies except 5.
I have experimented with different indexes to make this search more efficient for company 5. The one that seemed to make the most sense is
CREATE INDEX idx_date_created
ON item (date_created DESC NULLS LAST);
If I add this index, queries for the predominant company (id = 5) are greatly improved, but queries for all other companies go to crap.
Some results of EXPLAIN ANALYZE for company id 5 & 6 with and without the new index:
Company Id 5
Before new index
QUERY PLAN
Limit (cost=214874.63..214874.65 rows=10 width=639) (actual time=10481.989..10482.017 rows=10 loops=1)
-> Sort (cost=214874.63..218560.33 rows=1474282 width=639) (actual time=10481.985..10481.994 rows=10 loops=1)
Sort Key: photo_created
Sort Method: top-N heapsort Memory: 35kB
-> Seq Scan on photo (cost=0.00..183015.92 rows=1474282 width=639) (actual time=0.009..5345.551 rows=1473561 loops=1)
Filter: (company_id = 5)
Rows Removed by Filter: 402513
Total runtime: 10482.075 ms
After new index:
QUERY PLAN
Limit (cost=0.43..1.98 rows=10 width=639) (actual time=0.087..0.120 rows=10 loops=1)
-> Index Scan using idx_photo__photo_created on photo (cost=0.43..228408.04 rows=1474282 width=639) (actual time=0.084..0.099 rows=10 loops=1)
Filter: (company_id = 5)
Rows Removed by Filter: 26
Total runtime: 0.164 ms
Company Id 6
Before new index:
QUERY PLAN
Limit (cost=2204.27..2204.30 rows=10 width=639) (actual time=0.044..0.053 rows=3 loops=1)
-> Sort (cost=2204.27..2207.55 rows=1310 width=639) (actual time=0.040..0.044 rows=3 loops=1)
Sort Key: photo_created
Sort Method: quicksort Memory: 28kB
-> Index Scan using idx_photo__company_id on photo (cost=0.43..2175.96 rows=1310 width=639) (actual time=0.020..0.026 rows=3 loops=1)
Index Cond: (company_id = 6)
Total runtime: 0.100 ms
After new index:
QUERY PLAN
Limit (cost=0.43..1744.00 rows=10 width=639) (actual time=0.039..3938.986 rows=3 loops=1)
-> Index Scan using idx_photo__photo_created on photo (cost=0.43..228408.04 rows=1310 width=639) (actual time=0.035..3938.975 rows=3 loops=1)
Filter: (company_id = 6)
Rows Removed by Filter: 1876071
Total runtime: 3939.028 ms
I have run a full VACUUM and ANALYZE on the table, so PostgreSQL should have up-to-date statistics. Any ideas how I can get PostgreSQL to choose the right index for the company being queried?
LIMITis cheating. But will be more clear if you provideEXPLAINwithANALYZE, it will help us to inspect table statistics used for the planner. BTW, are you running regularVACUUM ANALYZE?company_ids are there? What percentage of the table iscompany_id = 5?CREATE INDEX idx ON item (company_id, date_created);