I have two tables. They each have around 2.3 million rows in them.
They are running on PostgreSQL 17.4
CREATE TABLE keyeddata (
category text NULL,
key1 text NULL,
key2 text NULL,
key3 text NULL,
key4 text NULL,
"parameter" text NULL,
value text NULL,
meta1 text NULL,
meta2 text NULL,
"version" int4 NULL,
updatedatetime timestamp NULL
);
CREATE UNIQUE INDEX keyeddataidx ON keyeddata USING btree (category, key1, key2, key3, key4, parameter) NULLS NOT DISTINCT;
CREATE TABLE keyeddataaudit (
category text NULL,
key1 text NULL,
key2 text NULL,
key3 text NULL,
key4 text NULL,
"parameter" text NULL,
value text NULL,
meta1 text NULL,
meta2 text NULL,
"version" int4 NULL,
updatedatetime timestamp NULL
);
CREATE UNIQUE INDEX keyeddataauditidx ON keyeddataaudit USING btree (category, key1, key2, key3, key4, parameter, version) NULLS NOT DISTINCT;
I want to delete values from the audit table if the following 2 requirements are satisfied:
- update datetime is past a certain date
- there is a row with identical keys in keyeddata (or keyeddataaudit, I don't mind which) with a higher version
The idea is to delete old values, but only if there is a more recent one.
I can get the same performance issues with either a select or a delete, so these examples are using a select.
If I run this query:
select count(*) from keyeddataaudit a, keyeddata t
WHERE a.updatedatetime < '2025-10-01' AND
(a.category = t.category) AND
(a.key1 = t.key1 ) AND
(a.key2 = t.key2 ) AND
(a.key3 = t.key3 ) AND
(a.key4 = t.key4 ) AND
(a.parameter = t.parameter ) AND
a.version < t.version;
Then it hits the index and completes in under a second.
However, that doesn't handle nulls.
If I change it to:
select count(*) from keyeddataaudit a, keyeddata t
WHERE a.updatedatetime < '2025-10-01' AND
(a.category is not distinct from t.category) AND
(a.key1 is not distinct from t.key1 ) AND
(a.key2 is not distinct from t.key2 ) AND
(a.key3 is not distinct from t.key3 ) AND
(a.key4 is not distinct from t.key4 ) AND
(a.parameter is not distinct from t.parameter ) AND
a.version < t.version;
Or if I try
select count(*) from keyeddataaudit a, keyeddata t
WHERE a.updatedatetime < '2025-10-01' AND
(a.category = t.category OR (a.category is null and t.category is null)) AND
(a.key1 = t.key1 OR (a.key1 is null and t.key1 is null) ) AND
(a.key2 = t.key2 OR (a.key2 is null and t.key2 is null) ) AND
(a.key3 = t.key3 OR (a.key3 is null and t.key3 is null) ) AND
(a.key4 = t.key4 OR (a.key4 is null and t.key4 is null) ) AND
(a.parameter = t.parameter OR (a.parameter is null and t.parameter is null) ) AND
a.version < t.version;
Then in both of these the query has run for over 5 minutes without completing before I cancelled it.
How do I change either the indexes or the queries so it actually uses them, please? Or alternatively is there some other way to achieve this with reasonable performance?
Edit: This completes in 48 seconds:
select count(*) from spreads.keyeddataaudit a, spreads.keyeddata t
WHERE a.updatedatetime < '2025-10-01' AND
(coalesce(a.category, 'null') = coalesce(t.category, 'null')) AND
(coalesce(a.key1, 'null') = coalesce(t.key1, 'null')) AND
(coalesce(a.key2, 'null') = coalesce(t.key2, 'null')) AND
(coalesce(a.key3, 'null') = coalesce(t.key3, 'null')) AND
(coalesce(a.key4, 'null') = coalesce(t.key4, 'null')) AND
(coalesce(a.parameter, 'null') = coalesce(t.parameter, 'null')) AND
a.version < t.version;
But would get confused if data ever contained 'null'
Which is solved by this, which completes in 1m29s.
select * from spreads.keyeddataaudit a, spreads.keyeddata t
WHERE a.updatedatetime < '2025-10-01' AND
(coalesce(a.category, 'null') = coalesce(t.category, 'null') and coalesce(a.category, 'null2') = coalesce(t.category, 'null2')) AND
(coalesce(a.key1, 'null') = coalesce(t.key1, 'null') and coalesce(a.key1, 'null2') = coalesce(t.key1, 'null2')) AND
(coalesce(a.key2, 'null') = coalesce(t.key2, 'null') and coalesce(a.key2, 'null2') = coalesce(t.key2, 'null2')) AND
(coalesce(a.key3, 'null') = coalesce(t.key3, 'null') and coalesce(a.key3, 'null2') = coalesce(t.key3, 'null2')) AND
(coalesce(a.key4, 'null') = coalesce(t.key4, 'null') and coalesce(a.key4, 'null2') = coalesce(t.key4, 'null2')) AND
(coalesce(a.parameter, 'null') = coalesce(t.parameter, 'null') and coalesce(a.parameter, 'null2') = coalesce(t.parameter, 'null2')) AND
a.version < t.version;
But it really feels like there should be a better way!