6

I'm working with a relatively large Postgres db containing several terabytes of data. It consists of data from 5 years back. As a result all the indexes are growing and therefore reducing the query performance as well as consuming lots of storage. Current index definition (original_index) is follows:

CREATE INDEX original_index ON table1
   USING btree (key10 DESC)

I'm trying to restrict the index to only 1 year back from now and index definition (constrained_index) for that is as follows: (I removed the original_index before creating constrained_index)

CREATE INDEX constrained_index ON table1 
   USING btree (key10 DESC) 
   WHERE (key10 >= '2023-06-01 00:00:00+00'::timestamp with time zone)

I was using a specific query for both the original_index and constrained_index to see before and after execution time.

When I have only original_index, query used it for the query planner. But when I have only constrained_index, query planner does not use constrained_index. Instead it used a sequential scan on the table.

After creating constrained_index I tried running vacuum verbose analyze table1; to update the stats of the db.

It's highly appreciated if someone can explain why it is not selecting index2.

Postgres Version: 12.5

Table definitions:

CREATE TABLE table1 (
    key1 integer,
    key2 character varying(128),
    key3 character varying(64),
    key4 character varying(128),
    key5 character varying(10),
    key6 timestamp with time zone,
    key7 timestamp with time zone,
    key8 integer NOT NULL GENERATED BY DEFAULT AS IDENTITY
    key9 boolean NOT NULL,
    key10 timestamp with time zone
);

CREATE TABLE table2 (
    key2 integer NOT NULL GENERATED BY DEFAULT AS IDENTITY
    key3 timestamp with time zone NOT NULL,
    key1 integer
);

CREATE TABLE table3 (
    key1 character varying(128) NOT NULL,
    key2 text
);

CREATE TABLE table4 (
    key1 character varying(64) NOT NULL,
    key2 text
);

Query:

SELECT table1.key1 AS a,
       table1.key2 AS b,
       table1.key3 AS c,
       table1.key4 AS d,
       table2.key1 AS e,
       COUNT(DISTINCT table1.key8) FILTER (WHERE table1.key5 = 'A') AS f,
       COUNT(DISTINCT table1.key8) FILTER (WHERE table1.key5 = 'B') AS g,
       COUNT(DISTINCT table1.key8) FILTER (WHERE table1.key5 = 'C') AS h,
       COUNT(DISTINCT table1.key8) FILTER (WHERE table1.key5 = 'D') AS i,
       COUNT(DISTINCT table1.key8) FILTER (WHERE table1.key5 IS NULL) AS j,
       SUM((table1.key6 - table1.key7)) AS k,
       COUNT(DISTINCT table1.key8) FILTER (WHERE table1.key9) AS l,
       COUNT(DISTINCT table1.key8) AS m
FROM table1
LEFT OUTER JOIN table3 ON (table1.key4 = table3.key1)
LEFT OUTER JOIN table2 ON (table1.key1 = table2.key2)
LEFT OUTER JOIN table4 ON (table1.key3 = table4.key1)
WHERE table1.key10 BETWEEN '2023-10-31 23:00:00+00:00' AND '2023-11-02 23:00:00+00:00'
GROUP BY 1,
         2,
         3,
         4,
         5,
         table2.key3,
         table4.key2,
         table3.key2
ORDER BY table2.key3 DESC,
         table1.key2 ASC,
         table4.key2 DESC,
         table3.key2 DESC,
         table2.key1 ASC
LIMIT 20

Output of EXPLAIN ANALYZE:

Limit  (cost=18341692.59..18341693.89 rows=20 width=159) (actual time=32149.681..32192.245 rows=20 loops=1)
  ->  GroupAggregate  (cost=18341692.59..18353658.57 rows=184092 width=159) (actual time=31749.816..31792.379 rows=20 loops=1)
        Group Key: table2.key3, table1.key2, table4.key2, table3.key2, table2.key1, table1.key1, table1.key3, table1.key4
        ->  Sort  (cost=18341692.59..18342152.82 rows=184092 width=115) (actual time=31749.763..31792.253 rows=80 loops=1)
              Sort Key: table2.key3 DESC, table1.key2, table4.key2 DESC, table3.key2 DESC, table2.key1, table1.key1, table1.key3, table1.key4
              Sort Method: external merge  Disk: 23424kB
              ->  Gather  (cost=83595.20..18314267.69 rows=184092 width=115) (actual time=31417.800..31681.660 rows=186540 loops=1)
                    Workers Planned: 2
                    Workers Launched: 2
                    ->  Hash Left Join  (cost=82595.20..18294858.49 rows=76705 width=115) (actual time=31392.569..31545.627 rows=62180 loops=3)
                          Hash Cond: ((table1.key4)::text = (table3.key1)::text)
                          ->  Hash Left Join  (cost=81113.79..18293175.70 rows=76705 width=90) (actual time=31379.926..31524.528 rows=62180 loops=3)
                                Hash Cond: ((table1.key3)::text = (table4.key1)::text)
                                ->  Parallel Hash Left Join  (cost=78481.53..18288053.08 rows=76705 width=72) (actual time=31361.889..31481.346 rows=62180 loops=3)
                                      Hash Cond: (table1.key1 = table2.key2)
                                      ->  Parallel Seq Scan on table1  (cost=0.00..18200395.20 rows=76705 width=60) (actual time=2.888..30824.027 rows=62180 loops=3)
                                            Filter: ((key10 >= '2023-10-31 23:00:00+00'::timestamp with time zone) AND (key10 "<= '2023-11-02 23:00:00+00'::timestamp with time zone))
                                            Rows Removed by Filter: 43882417
                                      ->  Parallel Hash  (cost=52397.57..52397.57 rows=1500557 width=16) (actual time=497.607..497.608 rows=1200445 loops=3)
                                            Buckets: 131072  Batches: 64  Memory Usage: 3712kB
                                            ->  Parallel Seq Scan on table2  (cost=0.00..52397.57 rows=1500557 width=16) (actual time=225.666..349.851 rows=1200445 loops=3)
                                ->  Hash  (cost=1244.45..1244.45 rows=71745 width=25) (actual time=17.720..17.721 rows=72031 loops=3)
                                      Buckets: 65536  Batches: 2  Memory Usage: 2568kB
                                      ->  Seq Scan on table4  (cost=0.00..1244.45 rows=71745 width=25) (actual time=0.019..6.313 rows=72031 loops=3)
                          ->  Hash  (cost=893.96..893.96 rows=46996 width=40) (actual time=12.367..12.368 rows=47030 loops=3)
                                Buckets: 65536  Batches: 1  Memory Usage: 3865kB
                                ->  Seq Scan on table3  (cost=0.00..893.96 rows=46996 width=40) (actual time=0.017..4.731 rows=47030 loops=3)
Planning Time: 2.249 ms
JIT:
  Functions: 90
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 8.216 ms, Inlining 84.266 ms, Optimization 596.840 ms, Emission 394.586 ms, Total 1083.909 ms
Execution Time: 32199.853 ms
21
  • 3
    Wed need to see EXPLAIN (ANALYZE, BUFFERS, SETTINGS)` output for the queries. Also, you shouldn't make a secret of your PostgreSQL version. Commented Apr 10, 2024 at 14:11
  • Best guess, because your query is covering enough records that the planner thinks it is faster to do the sequential scan then the index look up. @LaurenzAlbe's suggestion would help answer that. It might also be helpful to add the table definition to your question text. Commented Apr 10, 2024 at 15:24
  • @LaurenzAlbe updated with the information Commented Apr 10, 2024 at 15:33
  • I have no idea what <key1> is, but you need an index on this one. Most likely <key2> as well, but that's also not clear. This <key1>::text = (<key2>)::text is also interesting, it looks like you do some casting to text. That could destroy the usage of indexes, since it looks like that at least <key1> is a timestamp. Not TEXT. Next time share some information, like the DDL for the tables and indexes, the query itself and a compleet query plan that matches the DDL. We can now only guess what is going on. Commented Apr 10, 2024 at 15:41
  • 1
    You index definition still says WHERE (<key11> >= ...), which obviously does not match the condition on key10. Typo or not? And is the main table named "document" or "table1". Please get your question straight. And cut the noise. "<key1>" has no benefit over just key1 etc. Commented Apr 25, 2024 at 6:08

1 Answer 1

1

It's highly appreciated if someone can explain why it is not selecting index2.

Postgres documentation on partial indexes in example 11.2 says:

However, keep in mind that the predicate must match the conditions used in the queries that are supposed to benefit from the index. To be precise, a partial index can be used in a query only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index. PostgreSQL does not have a sophisticated theorem prover that can recognize mathematically equivalent expressions that are written in different forms. (Not only is such a general theorem prover extremely difficult to create, it would probably be too slow to be of any real use.) The system can recognize simple inequality implications, for example “x < 1” implies “x < 2”; otherwise the predicate condition must exactly match part of the query's WHERE condition or the index will not be recognized as usable. Matching takes place at query planning time, not at run time. As a result, parameterized query clauses do not work with a partial index. For example a prepared query with a parameter might specify “x < ?” which will never imply “x < 2” for all possible values of the parameter.

Your index has a simple >= condition. Your query has BETWEEN. Yes, they logically overlap, but as you have just observed, Postgres planner is not smart enough to deduce that they are equivalent.

Try to run a query with a simple >= filter at first. Even without the second <= part and with exactly the same constant date as in the index definition and see if the planner would use the partial index. Then change the constant date in a query to a different date. Then add the second <= part.

Also, double check whether the last phrase about parameterized queries not working with the partial indexes applies to your query.

Final note, see Example 11.4 and their warning: Do Not Use Partial Indexes as a Substitute for Partitioning.


Also, in general, in practice, partial indexes are useful when they drastically decrease the number of rows included in the index. 1 out of 5 years (20%) is not really significant. Even if you make the optimizer to generate the same plan for the partial index as for the full index, I would not expect a noticeable difference in the query performance. After all, the range of dates in the query is the same, it has to read the same number of rows from the table anyway. Searching in the b-tree the starting point of the range is fast (log(N)). You would not notice the difference when N becomes 0.2*N.

Sign up to request clarification or add additional context in comments.

12 Comments

My apologies: when cleaning up and reformatting the question, I introduced a typo. Before catching and fixing that just now, it was harder to see (but still somewhat visible) in the query plan that PostgreSQL does in fact translate the BETWEEN to a >= and<=` here: Filter: ((key10 >= '2023-10-31 23:00:00+00'::timestamp with time zone) AND (key10 "<= '2023-11-02 23:00:00+00'::timestamp with time zone))
@Zegarek, Yes, the engine converts BETWEEN into simple <= and >=, but you don't know when it happens, before the planner looks for suitable indexes or after; what representation of the query is used at which step of the query processing. Try to follow the steps from simple to complex and you'll see at what stage the optimizer fails to recognize the partial index.
"you don't know when it happens" - You do: "The rule system is located between the parser and the planner" - but I think it might even take place earlier that the query rewriter, right in the parser, the same way != gets swapped out for a <> right there, on the spot. Also, a BETWEEN x AND y is equivalent to a >= x AND a <= y... The various variants of BETWEEN are implemented in terms of the ordinary comparison operators
What's the problem with trying these steps? "Try to run a query with a simple >= filter at first. Even without the second <= part and with exactly the same constant date as in the index definition and see if the planner would use the partial index. Then change the constant date in a query to a different date. Then add the second <= part." This is what I would have done to narrow it down.
I don't see a problem trying these, go right ahead. What I wanted to address is this: "Your index has a simple >= condition. Your query has BETWEEN. Yes, they logically overlap, but as you have just observed, Postgres planner is not smart enough to deduce that they are equivalent." I double-checked, and it turns out the swap does take place right in the parser, so both the rewriter and the planner/optimizer already get it like that.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.