0

I am using a Postgres table to store HTML data. One of the columns is of type varchar, and this is where the HTML data is stored. My understanding is that this column datatype has no maximum length.

I tried to create a unique index on this column to prevent duplicate HTML entries from being added.

This table is referenced by another table. Duplicate HTML entries are not allowed. This is a form of data compression using data (or table) normalization.

If Table A references Table B, and Table B stores de-duplicated HTML data, then less storage space is required to store the complete dataset since rather than storing duplicated HTML entries in Table A, these entries are normalized and stored in a separate table, Table B.

When I tried to create a unique index, I get an error:

CREATE UNIQUE INDEX html_source_div_html_source_div_idx 
ON rightmove.html_source_div (html_source_div)

SQL Error [54000]: ERROR: index row size 4536 exceeds btree version 4 maximum 2704 for index "html_source_div_html_source_div_idx"
Detail: Index row references tuple (0,46) in relation "html_source_div"
Hint: Values larger than 1/3 of a buffer page cannot be indexed. Consider a function index of an MD5 hash of the value, or use full text indexing.

I want to add both an index for fast lookup of existing data and a unique constraint to prevent duplicate entries from being inserted. Postgres can probably do both functions using a single index provided the right index type is used.

I do not want to md5hash the data, because there is a risk of a collision. If there is a collision the whole process will break.

  • Is there a type of index supported by Postgres which is designed to work on text based data of arbitrary length?
  • Could I use such an index to enforce the unique constraint as well as improve text-search performance for select queries?

2 Answers 2

3

You can use a hash index. While they do not support unique constraints, they do support EXCLUDE constraints which fulfill much the same function.

create table jj (x text);
alter table jj add constraint lkj exclude using hash (x with =);

It will automatically resolve hash collision (I think, I haven't tested as I don't know how to generate collisions at will)

Sign up to request clarification or add additional context in comments.

1 Comment

Today I learned: you can have exclusion constraints over hash indexes.
2

PostreSQL indexes indeed have a max size, you could try to mitigate hash collisions with a computed column, maybe try something like this:

ALTER TABLE html_source_div ADD COLUMN html_hash bytea;
UPDATE html_source_div SET html_hash = digest(html_source_div, 'sha256');

and

CREATE UNIQUE INDEX html_source_div_html_hash_idx ON html_source_div (html_hash);

3 Comments

I like this idea. I had been considering writing Python side logic to maintain a md5 or sha256 hash which could be used as a cheap way to detect A: if a row with the same data may exist (which is a shortcut if there are no matching rows) and B: the subset of potential matches which must be explicitly checked. This kind of lookup can be fast because it could support an index.
However, I have some further questions. What datatype should be used to store the hash? Apparently the most appropriate type for an md5 is a UUID type. But this will not be the case for sha256, which has a different length.
Secondly, is there a way to get postgres to calculate these hashes instead of having to do it client side? I can see that your ALTER TABLE suggestion will calculate a default, but what about in the case where new rows are inserted? To be honest I'm not sure if this really makes much sense, since the client is going to have to calculate the hash anyway to check if there are any hits for that hash.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.