0

I am porting my application from MS SQL to PostgreSQL 10.1 and I got stuck on dealing with XML fields. I changed "exist()" to "xmlexists()" in my queries, so a typical query is now looks like:

SELECT t."id", t."fullname"
FROM "candidates" t
WHERE xmlexists('//assignments/assignment/project_id[.=''6512779208625374885'']'
           PASSING BY REF t.assignments );

assuming that "assignment" column contains XML data with the following structure:

<assignments>
<assignment>
    <project_id>6512779208625374885</project_id>
    <start_date>2018-02-05T14:30:06+00:00</start_date>
    <state_id>1</state_id>
</assignment>
<assignment>
    <project_id>7512979208625374996</project_id>
    <start_date>2017-12-01T15:30:00+00:00</start_date>
    <state_id>0</state_id>
</assignment>
<assignment>
    <project_id>5522979707625370402</project_id>
    <start_date>2017-12-15T10:00:00+00:00</start_date>
    <state_id>1</state_id>
</assignment>

The question is how to build an efficient index for this type of a query. A I understand there is no generic xpath index like those on MS SQL, so I need to build a specific one. But all the examples I managed to find (e.g.Postgresql 9.x: Index to optimize `xpath_exists` (XMLEXISTS) queries) were about nested fields, not arrays.

P.S. I tried switching from XML to JSONB, but this would require rewriting a lot of queries using jsonb_array_elements() with joins, which I want to avoid.

1 Answer 1

3

You can exploit the fact that xpath() returns an array.

The following expression:

xpath('/assignments/assignment/project_id/text()', assignments)::text[]

returns an array of strings with all project IDs. This expression can be indexed:

create index on candidates using gin ((xpath('/assignments/assignment/project_id/text()', assignments)::text[]));

And that index can be used by the following query:

select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['6512779208625374885'];

The @> is the "contains" operator for arrays is supported for GIN indexes.

You can use that to check for multiple IDs with a single condition:

select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['6512779208625374885', '6512779208625374886'];

The above would return rows that contain both project_ids in the XML.

If you use the "overlaps" operator &&, you can also search for rows that contain any of the elements:

select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] && array['6512779208625374885', '6512779208625374886'];

The above returns rows that contain at least one of those project_ids in the XML.

For more details on array operators, see the manual

The drawback is, that GIN indexes are bigger and more expensive to maintain than BTree indexes.


I verified this with the following test setup:

create table candidates
(
  id integer,
  assignments  xml
);

insert into candidates
select i, format('<assignments>
                    <assignment>
                        <project_id>%s</project_id>
                        <start_date>2018-02-05T14:30:06+00:00</start_date>
                        <state_id>1</state_id>
                    </assignment>
                    <assignment>
                        <project_id>%s</project_id>
                        <start_date>2017-12-01T15:30:00+00:00</start_date>
                        <state_id>0</state_id>
                    </assignment>
                    <assignment>
                        <project_id>%s</project_id>
                        <start_date>2017-12-15T10:00:00+00:00</start_date>
                        <state_id>1</state_id>
                    </assignment></assignments>', i, 10000000 + i, 20000000 + i)::xml
from generate_series(1,1000000) as i;

So the table candidates now contains a million rows with 3 different project_ids each.

explain (analyze, buffers)
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['10000042'];

shows the following plan:

QUERY PLAN                                                                                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test.candidates  (cost=29.25..6604.48 rows=5000 width=473) (actual time=0.032..0.032 rows=1 loops=1)                             
  Output: id, assignments                                                                                                                             
  Recheck Cond: ((xpath('/assignments/assignment/project_id/text()'::text, candidates.assignments, '{}'::text[]))::text[] @> '{10000047}'::text[])    
  Heap Blocks: exact=1                                                                                                                                
  Buffers: shared hit=5                                                                                                                               
  ->  Bitmap Index Scan on candidates_xpath_idx  (cost=0.00..28.00 rows=5000 width=0) (actual time=0.028..0.028 rows=1 loops=1)                       
        Index Cond: ((xpath('/assignments/assignment/project_id/text()'::text, candidates.assignments, '{}'::text[]))::text[] @> '{10000047}'::text[])
        Buffers: shared hit=4                                                                                                                         
Planning time: 0.162 ms                                                                                                                               
Execution time: 0.078 ms                                                                                                                              

Less than a tenth of millisecond to search through a million XML documents doesn't seem too bad.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, it works! Is it possible to use the same approach to query for multiple values using "match some" logic? I have found that '//assignments/assignment/project_id[.=(''6512779208625374885'',''6512779208625374886'')]' does not work in PostgreSQL. And there is no "intersects" operator, only the "contains" operator.
@BorisL: the "contains" operator (@>) also works for multiple values and checks for all of them. The "overlaps" operator (&&) check for any element from the array.
Is it possible to use the same approach for jsonb fields? where assignments @> '[{"project_id":"6520593905550770710"}]' works fine, but there is no && operator in this case. And I failed to find something like jsonb_array_keys() function to extract array(varchar) from jsonb.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.