How to index a varchar column of unlimited length in Postgres?

Question

I am using a Postgres table to store HTML data. One of the columns is of type varchar, and this is where the HTML data is stored. My understanding is that this column datatype has no maximum length.

I tried to create a unique index on this column to prevent duplicate HTML entries from being added.

This table is referenced by another table. Duplicate HTML entries are not allowed. This is a form of data compression using data (or table) normalization.

If Table A references Table B, and Table B stores de-duplicated HTML data, then less storage space is required to store the complete dataset since rather than storing duplicated HTML entries in Table A, these entries are normalized and stored in a separate table, Table B.

When I tried to create a unique index, I get an error:

CREATE UNIQUE INDEX html_source_div_html_source_div_idx 
ON rightmove.html_source_div (html_source_div)

SQL Error [54000]: ERROR: index row size 4536 exceeds btree version 4 maximum 2704 for index "html_source_div_html_source_div_idx"
Detail: Index row references tuple (0,46) in relation "html_source_div"
Hint: Values larger than 1/3 of a buffer page cannot be indexed. Consider a function index of an MD5 hash of the value, or use full text indexing.

I want to add both an index for fast lookup of existing data and a unique constraint to prevent duplicate entries from being inserted. Postgres can probably do both functions using a single index provided the right index type is used.

I do not want to md5hash the data, because there is a risk of a collision. If there is a collision the whole process will break.

Is there a type of index supported by Postgres which is designed to work on text based data of arbitrary length?
Could I use such an index to enforce the unique constraint as well as improve text-search performance for select queries?

jjanes · Accepted Answer · 2024-12-22 21:31:07Z

3

You can use a hash index. While they do not support unique constraints, they do support EXCLUDE constraints which fulfill much the same function.

create table jj (x text);
alter table jj add constraint lkj exclude using hash (x with =);

It will automatically resolve hash collision (I think, I haven't tested as I don't know how to generate collisions at will)

answered Dec 22, 2024 at 21:31

jjanes

44.9k5 gold badges39 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Laurenz Albe Dec 23, 2024 at 6:19

Today I learned: you can have exclusion constraints over hash indexes.

Luca Viale · Accepted Answer · 2024-12-22 13:38:23Z

2

PostreSQL indexes indeed have a max size, you could try to mitigate hash collisions with a computed column, maybe try something like this:

ALTER TABLE html_source_div ADD COLUMN html_hash bytea;
UPDATE html_source_div SET html_hash = digest(html_source_div, 'sha256');

and

CREATE UNIQUE INDEX html_source_div_html_hash_idx ON html_source_div (html_hash);

answered Dec 22, 2024 at 13:38

Luca Viale

515 bronze badges

3 Comments

user2138149 Dec 22, 2024 at 13:41

I like this idea. I had been considering writing Python side logic to maintain a md5 or sha256 hash which could be used as a cheap way to detect A: if a row with the same data may exist (which is a shortcut if there are no matching rows) and B: the subset of potential matches which must be explicitly checked. This kind of lookup can be fast because it could support an index.

user2138149 Dec 22, 2024 at 13:42

However, I have some further questions. What datatype should be used to store the hash? Apparently the most appropriate type for an md5 is a UUID type. But this will not be the case for sha256, which has a different length.

user2138149 Dec 22, 2024 at 13:44

Secondly, is there a way to get postgres to calculate these hashes instead of having to do it client side? I can see that your ALTER TABLE suggestion will calculate a default, but what about in the case where new rows are inserted? To be honest I'm not sure if this really makes much sense, since the client is going to have to calculate the hash anyway to check if there are any hits for that hash.

Collectives™ on Stack Overflow

How to index a varchar column of unlimited length in Postgres?

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related