PostgreSQL: Index to optimize xmlexists on xml with arrays

Question

I am porting my application from MS SQL to PostgreSQL 10.1 and I got stuck on dealing with XML fields. I changed "exist()" to "xmlexists()" in my queries, so a typical query is now looks like:

SELECT t."id", t."fullname"
FROM "candidates" t
WHERE xmlexists('//assignments/assignment/project_id[.=''6512779208625374885'']'
           PASSING BY REF t.assignments );

assuming that "assignment" column contains XML data with the following structure:

<assignments>
<assignment>
    <project_id>6512779208625374885</project_id>
    <start_date>2018-02-05T14:30:06+00:00</start_date>
    <state_id>1</state_id>
</assignment>
<assignment>
    <project_id>7512979208625374996</project_id>
    <start_date>2017-12-01T15:30:00+00:00</start_date>
    <state_id>0</state_id>
</assignment>
<assignment>
    <project_id>5522979707625370402</project_id>
    <start_date>2017-12-15T10:00:00+00:00</start_date>
    <state_id>1</state_id>
</assignment>

The question is how to build an efficient index for this type of a query. A I understand there is no generic xpath index like those on MS SQL, so I need to build a specific one. But all the examples I managed to find (e.g.Postgresql 9.x: Index to optimize `xpath_exists` (XMLEXISTS) queries) were about nested fields, not arrays.

P.S. I tried switching from XML to JSONB, but this would require rewriting a lot of queries using jsonb_array_elements() with joins, which I want to avoid.

score 3 · Accepted Answer · 2018-02-05 14:33:03Z

You can exploit the fact that xpath() returns an array.

The following expression:

xpath('/assignments/assignment/project_id/text()', assignments)::text[]

returns an array of strings with all project IDs. This expression can be indexed:

create index on candidates using gin ((xpath('/assignments/assignment/project_id/text()', assignments)::text[]));

And that index can be used by the following query:

select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['6512779208625374885'];

The @> is the "contains" operator for arrays is supported for GIN indexes.

You can use that to check for multiple IDs with a single condition:

select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['6512779208625374885', '6512779208625374886'];

The above would return rows that contain both project_ids in the XML.

If you use the "overlaps" operator &&, you can also search for rows that contain any of the elements:

select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] && array['6512779208625374885', '6512779208625374886'];

The above returns rows that contain at least one of those project_ids in the XML.

For more details on array operators, see the manual

The drawback is, that GIN indexes are bigger and more expensive to maintain than BTree indexes.

I verified this with the following test setup:

create table candidates
(
  id integer,
  assignments  xml
);

insert into candidates
select i, format('<assignments>
                    <assignment>
                        <project_id>%s</project_id>
                        <start_date>2018-02-05T14:30:06+00:00</start_date>
                        <state_id>1</state_id>
                    </assignment>
                    <assignment>
                        <project_id>%s</project_id>
                        <start_date>2017-12-01T15:30:00+00:00</start_date>
                        <state_id>0</state_id>
                    </assignment>
                    <assignment>
                        <project_id>%s</project_id>
                        <start_date>2017-12-15T10:00:00+00:00</start_date>
                        <state_id>1</state_id>
                    </assignment></assignments>', i, 10000000 + i, 20000000 + i)::xml
from generate_series(1,1000000) as i;

So the table candidates now contains a million rows with 3 different project_ids each.

explain (analyze, buffers)
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['10000042'];

shows the following plan:

QUERY PLAN                                                                                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test.candidates  (cost=29.25..6604.48 rows=5000 width=473) (actual time=0.032..0.032 rows=1 loops=1)                             
  Output: id, assignments                                                                                                                             
  Recheck Cond: ((xpath('/assignments/assignment/project_id/text()'::text, candidates.assignments, '{}'::text[]))::text[] @> '{10000047}'::text[])    
  Heap Blocks: exact=1                                                                                                                                
  Buffers: shared hit=5                                                                                                                               
  ->  Bitmap Index Scan on candidates_xpath_idx  (cost=0.00..28.00 rows=5000 width=0) (actual time=0.028..0.028 rows=1 loops=1)                       
        Index Cond: ((xpath('/assignments/assignment/project_id/text()'::text, candidates.assignments, '{}'::text[]))::text[] @> '{10000047}'::text[])
        Buffers: shared hit=4                                                                                                                         
Planning time: 0.162 ms                                                                                                                               
Execution time: 0.078 ms

Less than a tenth of millisecond to search through a million XML documents doesn't seem too bad.

Thanks, it works! Is it possible to use the same approach to query for multiple values using "match some" logic? I have found that '//assignments/assignment/project_id[.=(''6512779208625374885'',''6512779208625374886'')]' does not work in PostgreSQL. And there is no "intersects" operator, only the "contains" operator.
@BorisL: the "contains" operator (@>) also works for multiple values and checks for all of them. The "overlaps" operator (&&) check for any element from the array.
Is it possible to use the same approach for jsonb fields? where assignments @> '[{"project_id":"6520593905550770710"}]' works fine, but there is no && operator in this case. And I failed to find something like jsonb_array_keys() function to extract array(varchar) from jsonb.

Collectives™ on Stack Overflow

PostgreSQL: Index to optimize xmlexists on xml with arrays

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related