2

I'm using postgres 16, and I have a number of tables where I need to treat their display ID columns as case insensitive, and also handle LIKE queries with wildcards (ex: P%123%). In order to handle these queries efficiently with range scans, I needed to set the collation for those columns to C rather than the default.

With the new requirement for case insensitive searching, I'm considering changing the column's datatype to citext (https://www.postgresql.org/docs/current/citext.html). Will leaving the collation on these columns as C cause issues, since C is a case sensitive collation? What is the recommended collation for a citext column?

1 Answer 1

3

That citext doc already tells you it's somewhat superseded by case-insensitive collations:

Consider using nondeterministic collations (see Section 23.2.2.4) instead of this module. They can be used for case-insensitive comparisons, accent-insensitive comparisons, and other combinations, and they handle more Unicode special cases correctly.

You're better off with a regular text type and a custom collation, or "C" with an expression index using lower(). You can find a few benchmarks here:

If you upgrade to version 18 (release candidate 1 is out), you get nondeterministic collation support for LIKE which handles your prefix search.
In PostgreSQL 16, use collate "C" with a text_pattern_ops expression index:
demo at db<>fiddle

create unique index on test_lower_collate_c 
  (lower(a) collate "C" text_pattern_ops);

explain analyse verbose
select count(*) from test_lower_collate_c where lower(a) like 'eb%5%';
QUERY PLAN
Aggregate (cost=146.29..146.30 rows=1 width=8) (actual time=0.283..0.284 rows=1 loops=1)
Output: count(*)
-> Bitmap Heap Scan on public.test_lower_collate_c (cost=4.94..146.28 rows=5 width=0) (actual time=0.072..0.277 rows=26 loops=1)
Filter: (lower(test_lower_collate_c.a) ~~ 'eb%5%'::text)
Rows Removed by Filter: 162
Heap Blocks: exact=133
-> Bitmap Index Scan on test_lower_collate_c_lower_idx (cost=0.00..4.94 rows=65 width=0) (actual time=0.043..0.043 rows=188 loops=1)
Index Cond: ((lower(test_lower_collate_c.a) >= 'eb'::text) AND (lower(test_lower_collate_c.a) < 'ec'::text))
Planning Time: 0.418 ms
Execution Time: 0.359 ms

Will leaving the collation on these columns as C cause issues, since C is a case sensitive collation?

Values get folded to lowercase when ingested into citext type so the case differences are lost - that's not a problem.

It might be a problem if you're dealing with accents and other non-ASCII texts because collate "C" places them in a different range. According to it where a ~>=~ 'ea' and a ~<~ 'eb' won't find values starting with 'eá' because accent variants go somewhere way behind the whole alphabet instead of following their base letter.

Another thing of note is that I don't see the optimiser adding the range scan to pattern-based search on its own. Given a query like this:
demo at db<>fiddle

select from test_lower_collate_c where lower(a) like 'eb%5%';

text_pattern_ops gets you an additional condition to speed up the search based on prefix

Index Cond: ((lower(test_lower_collate_c.a) ~>=~ 'eb'::text) AND (lower(test_lower_collate_c.a) ~<~ 'ec'::text))

Meanwhile, with citext_pattern_ops I needed to add them on my own:

select from test_citext_collate_c where a like 'eb%5%' and a ~>=~ 'eb' and a ~<~ 'ec';;

And the timing was still worse than for the expression-based index.


What is the recommended collation for a citext column?

If your values/patterns are simple ASCII, COLLATE "C" can handle them fast. Otherwise, it just won't work right.

Sign up to request clarification or add additional context in comments.

1 Comment

I am on postgres 16, not 18 so I do need a solution that works on 16, and I am doing prefix searches. We can't just upgrade our production DB.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.