I have two columns: column1 with two distinct values (0, 1) and column2 with 3 distinct values ('A', 'B', 'C'). Value in column1 is always 0 for 'A' and 'B' in column2, but if the value in column2 is 'C' then column1 has the next distribution: (0: 15%, 1: 85%). But I have another column departmentid. In some departments distribution of column1 in case column2 = 'C' is (0: 0%, 1: 100%). So in some cases when I have a query
SELECT * FROM mytable WHERE departmenid = 42 AND column2 = 'C' AND column1 = 0 ORDER BY id LIMIT 10;
PostgreSQL chooses to index scan by id and assumes there will be ~25000 rows, but in the table, there are no rows with this filter. So the query scans the whole table by index scan and it takes too long. If db selects bitmap scan it will be 50x times faster (based on other queries on this table). Indexes on all these columns are present.
I have two questions:
- I created the statistic object on column1, column2 and departmentid, but there are no two interesting to me dependencies: from column2 to column1 and from column2, departmenid to column1. Why? In the statistic object on column1 and column2 dependencies are null. Of course I made analyze.
- How can I speed up this query? Is there any way to speed up this query without creating a specific 4-column index (departmenid, column2, column1, id)? Because there are a lot of other filters and different orders in real production query (the problem reproduces with this minimal query).
PostgreSQL 16.1, auto -vacuum and -analyze every day, ~3 million rows in the table
update: table definition (it has more than 100 columns and more than 50 indexes, show only related):
CREATE TABLE mytable
(
id SERIAL
PRIMARY KEY,
column1 INTEGER DEFAULT 0 NOT NULL,
column2 TEXT NOT NULL,
departmentid INTEGER
);
CREATE INDEX mytable_departmentid_index
ON mytable (departmentid);
CREATE INDEX mytable_column1_index
ON mytable (column1);
CREATE INDEX mytable_column2_index
ON mytable (column2);
-- tested statistic
CREATE STATISTICS mytable_column2_column1 ON column2, column1 FROM mytable;
CREATE STATISTICS mytable_departmentid_column1_column2 ON departmentid, column1, column2 FROM mytable;
departmentid - has < 0.0002% nulls, statistic says there are no nulls. Index on id exist due to id is primary key. All indexes are btree.
query plan (other columns are removed, numbers are actual):
Limit (cost=0.68..501.79 rows=10 width=2690) (actual time=35351.049..47175.738 rows=1 loops=1)
Output: id, column1, column2, departmentid
Buffers: shared hit=1682274 read=1646793 dirtied=1640 written=980
I/O Timings: shared/local read=39193.882 write=9.565
WAL: records=1637 fpi=1637 bytes=3034199
-> Index Scan using mytable_pkey on public.mytable (cost=0.68..1392081.41 rows=27780 width=2690) (actual time=35351.048..47175.735 rows=1 loops=1)
Output: id, column1, column2, departmentid
Filter: ((mytable.departmentid = 42) AND (mytable.column2 = 'C'::text) AND (mytable.column1 = 0))
Rows Removed by Filter: 3536431
Buffers: shared hit=1682274 read=1646793 dirtied=1640 written=980
I/O Timings: shared/local read=39193.882 write=9.565
WAL: records=1637 fpi=1637 bytes=3034199
Settings: effective_cache_size = '8008368kB', effective_io_concurrency = '0', geqo_effort = '10', jit = 'off', random_page_cost = '1.2', search_path = 'public'
Planning Time: 0.503 ms
Execution Time: 47175.783 ms
update 2: Question 1 has my mistake, there is column2, departmenid to column1 dependency, but no column2 to column1. And also I was wrong that column1 is always 0 for column2 in ('A','B'). Correct information is present in statistic, but that is not a big percent
Why the 4-column index is bad: The end user can select a column to order, so the index will just cover the default case. Filtering by column1 and column2 is also the default. So I prefer to find a more general solution
Question 3: Maybe there is some way to fix this by changing the DDL?
Question 4: Is there a way to calculate dependencies between value to get it by query? Like corr() function.
Information from pg_stats_ext (replaced all attribute number with name):
| statistics_name | attnames | kinds | inherited | n_distinct | dependencies | most_common_vals | most_common_val_nulls | most_common_freqs | most_common_base_freqs |
|---|---|---|---|---|---|---|---|---|---|
| mytable_column2_column1 | {column1,column2} | {d,f,m} | false | {"column1, column2": 6} | <null> | {{1,C},{0,C},{0,B},{1,B},{1,A},{0,A}} | {{false,false},{false,false},{false,false},{false,false},{false,false},{false,false}} | {0.80046,0.13586,0.02913,0.02236,0.00660,0.00556} | {0.77662,0.15970,0.00878,0.04271,0.01009,0.00207} |
| mytable_departmentid_column1_column2 | {column1,departmentid,column2} | {d,f,m} | false | {"column1, departmentid": 1433, "column1, column2": 6, "departmentid, column2": 1492, "column1, departmentid, column2": 1784} | {"departmentid => column1": 0.137700, "departmentid => column2": 0.115367, "column1, departmentid => column2": 0.228133, "departmentid, column2 => column1": 0.226600} | {{1,44,C},{0,43,C},{1,42,C},{1,43,C}... | {{false,false,false},{false,false,false},{false,false,false},{false,false,false}... | {0.39790,0.06323,0.04683,0.02073... | {0.32481,0.01344,0.03789,0.06539... |
Information from pg_stats:
| attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_reqs | correlation | histogram_bounds |
|---|---|---|---|---|---|---|---|---|
| column1 | false | 0 | 4 | 2 | {1,0} | {0.82943,017056} | 0.64609 | <null> |
| departmentid | false | 0 | 4 | 1080 | {44,43,42,45... | {0.41823,0.08420,0.04879,0.017466... | 0.72736 | {57,72,147,200... |
| column2 | false | 0 | 16 | 3 | {C,B,A} | {0.93633,0.05150,0.01216} | 0.32751 | <null> |
EXPLAIN ANALYZEoutput for the query as text update to the question.(departmentid, column2, column1)? At least for your given query, that should be perferct. How do other use cases look like? If you always query with adepartmentid, this combined index should already be very useful. If you combine this query with joins, perhaps also include theid. Depending on use cases, you might also want to cluster the table by this index, but perhaps I start overshooting the question.(departmentid, column2, column1, column4)and it is not used. I now created(departmentid, column2, column1)and it is also not used. I think it is because of a lot of expected rows (27780)