You can exploit the fact that xpath() returns an array.
The following expression:
xpath('/assignments/assignment/project_id/text()', assignments)::text[]
returns an array of strings with all project IDs. This expression can be indexed:
create index on candidates using gin ((xpath('/assignments/assignment/project_id/text()', assignments)::text[]));
And that index can be used by the following query:
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['6512779208625374885'];
The @> is the "contains" operator for arrays is supported for GIN indexes.
You can use that to check for multiple IDs with a single condition:
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['6512779208625374885', '6512779208625374886'];
The above would return rows that contain both project_ids in the XML.
If you use the "overlaps" operator &&, you can also search for rows that contain any of the elements:
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] && array['6512779208625374885', '6512779208625374886'];
The above returns rows that contain at least one of those project_ids in the XML.
For more details on array operators, see the manual
The drawback is, that GIN indexes are bigger and more expensive to maintain than BTree indexes.
I verified this with the following test setup:
create table candidates
(
id integer,
assignments xml
);
insert into candidates
select i, format('<assignments>
<assignment>
<project_id>%s</project_id>
<start_date>2018-02-05T14:30:06+00:00</start_date>
<state_id>1</state_id>
</assignment>
<assignment>
<project_id>%s</project_id>
<start_date>2017-12-01T15:30:00+00:00</start_date>
<state_id>0</state_id>
</assignment>
<assignment>
<project_id>%s</project_id>
<start_date>2017-12-15T10:00:00+00:00</start_date>
<state_id>1</state_id>
</assignment></assignments>', i, 10000000 + i, 20000000 + i)::xml
from generate_series(1,1000000) as i;
So the table candidates now contains a million rows with 3 different project_ids each.
explain (analyze, buffers)
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['10000042'];
shows the following plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test.candidates (cost=29.25..6604.48 rows=5000 width=473) (actual time=0.032..0.032 rows=1 loops=1)
Output: id, assignments
Recheck Cond: ((xpath('/assignments/assignment/project_id/text()'::text, candidates.assignments, '{}'::text[]))::text[] @> '{10000047}'::text[])
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on candidates_xpath_idx (cost=0.00..28.00 rows=5000 width=0) (actual time=0.028..0.028 rows=1 loops=1)
Index Cond: ((xpath('/assignments/assignment/project_id/text()'::text, candidates.assignments, '{}'::text[]))::text[] @> '{10000047}'::text[])
Buffers: shared hit=4
Planning time: 0.162 ms
Execution time: 0.078 ms
Less than a tenth of millisecond to search through a million XML documents doesn't seem too bad.