3

We have queries of the form

select sum(acol)
where xpath_exists('/Root/KeyValue[Key="val"]/Value//text()', xmlcol)

What index can be built to speed up the where clause ?

A btree index created using

create index idx_01 using btree(xpath_exists('/Root/KeyValue[Key="val"]/Value//text()', xmlcol))

does not seem to be used at all.

EDIT

Setting enable_seqscan to off, the query using xpath_exists is much faster (one order of magnitude) and clearly shows using the corresponding index (the btree index built with xpath_exists).

Any clue why PostgreSQL would not be using the index and attempt a much slower sequential scan ?

Since I do not want to disable sequential scanning globally, I am back to square one and I am happily welcoming suggestions.

EDIT 2 - Explain plans

See below - Cost of first plan (seqscan off) is slightly higher but processing time much faster

b2box=# set enable_seqscan=off;
SET
b2box=# explain analyze
Select count(*) 
from B2HEAD.item
where cluster = 'B2BOX' and (  ( xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()', content) )  )  offset 0 limit 1;
                                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=22766.63..22766.64 rows=1 width=0) (actual time=606.042..606.042 rows=1 loops=1)
   ->  Aggregate  (cost=22766.63..22766.64 rows=1 width=0) (actual time=606.039..606.039 rows=1 loops=1)
         ->  Bitmap Heap Scan on item  (cost=1058.65..22701.38 rows=26102 width=0) (actual time=3.290..603.823 rows=4085 loops=1)
               Filter: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) AND ((cluster)::text = 'B2BOX'::text))
               ->  Bitmap Index Scan on item_counter_01  (cost=0.00..1052.13 rows=56515 width=0) (actual time=2.283..2.283 rows=4085 loops=1)
                     Index Cond: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) = true)
 Total runtime: 606.136 ms
(7 rows)

plan on explain.depesz.com

b2box=# set enable_seqscan=on;
SET
b2box=# explain analyze
Select count(*) 
from B2HEAD.item
where cluster = 'B2BOX' and (  ( xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()', content) )  )  offset 0 limit 1;
                                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=22555.71..22555.72 rows=1 width=0) (actual time=10864.163..10864.163 rows=1 loops=1)
   ->  Aggregate  (cost=22555.71..22555.72 rows=1 width=0) (actual time=10864.160..10864.160 rows=1 loops=1)
         ->  Seq Scan on item  (cost=0.00..22490.45 rows=26102 width=0) (actual time=33.574..10861.672 rows=4085 loops=1)
               Filter: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) AND ((cluster)::text = 'B2BOX'::text))
               Rows Removed by Filter: 108945
 Total runtime: 10864.242 ms
(6 rows)

plan on explain.depesz.com

2
  • Please post the explain plans. Commented Apr 18, 2013 at 11:14
  • @JakubKania - Please see edit above Commented Apr 18, 2013 at 13:04

1 Answer 1

4

Planner cost parameters

Cost of first plan (seqscan off) is slightly higher but processing time much faster

This tells me that your random_page_cost and seq_page_cost are probably wrong. You're likely on storage with fast random I/O - either because most of the database is cached in RAM or because you're using SSD, SAN with cache, or other storage where random I/O is inherently fast.

Try:

SET random_page_cost = 1;
SET seq_page_cost = 1.1;

to greatly reduce the cost param differences and then re-run. If that does the job consider changing those params in postgresql.conf..

Your row-count estimates are reasonable, so it doesn't look like a planner mis-estimation problem or a problem with bad table statistics.

Incorrect query

Your query is also incorrect. OFFSET 0 LIMIT 1 without an ORDER BY will produce unpredictable results unless you're guaranteed to have exactly one match, in which case the OFFSET ... LIMIT ... clauses are unnecessary and can be removed entirely.

You're usually much better off phrasing such queries as SELECT max(...) or SELECT min(...) where possible; PostgreSQL will tend to be able to use an index to just pluck off the desired value without doing an expensive table scan or an index scan and sort.

Tips

BTW, for future questions the PostgreSQL wiki has some good information in the performance category and a guide to asking Slow query questions.

Sign up to request clarification or add additional context in comments.

1 Comment

Great answer! Not only your suggested parameter adjustments work fine but you are also right in suspecting that we are using SSDs with plenty of RAM and that.... we are starting with PostgreSQL support. The queries are programmatically generated; we will improve the generator to incorporate your comments on offset and limit. Finally, many thanks for the useful links.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.