2

I have two tables. They each have around 2.3 million rows in them.

They are running on PostgreSQL 17.4

CREATE TABLE keyeddata (
    category text NULL,
    key1 text NULL,
    key2 text NULL,
    key3 text NULL,
    key4 text NULL,
    "parameter" text NULL,
    value text NULL,
    meta1 text NULL,
    meta2 text NULL,
    "version" int4 NULL,
    updatedatetime timestamp NULL
);

CREATE UNIQUE INDEX keyeddataidx ON keyeddata USING btree (category, key1, key2, key3, key4, parameter) NULLS NOT DISTINCT;

CREATE TABLE keyeddataaudit (
    category text NULL,
    key1 text NULL,
    key2 text NULL,
    key3 text NULL,
    key4 text NULL,
    "parameter" text NULL,
    value text NULL,
    meta1 text NULL,
    meta2 text NULL,
    "version" int4 NULL,
    updatedatetime timestamp NULL
);
CREATE UNIQUE INDEX keyeddataauditidx ON keyeddataaudit USING btree (category, key1, key2, key3, key4, parameter, version) NULLS NOT DISTINCT;

I want to delete values from the audit table if the following 2 requirements are satisfied:

  1. update datetime is past a certain date
  2. there is a row with identical keys in keyeddata (or keyeddataaudit, I don't mind which) with a higher version

The idea is to delete old values, but only if there is a more recent one.

I can get the same performance issues with either a select or a delete, so these examples are using a select.

If I run this query:

select count(*) from keyeddataaudit a, keyeddata t
  WHERE a.updatedatetime < '2025-10-01' AND
  (a.category = t.category) AND
  (a.key1 = t.key1 ) AND
  (a.key2 = t.key2 ) AND
  (a.key3 = t.key3 ) AND
  (a.key4 = t.key4 ) AND
  (a.parameter = t.parameter ) AND
  a.version < t.version;

Then it hits the index and completes in under a second.

However, that doesn't handle nulls.

If I change it to:

select count(*) from keyeddataaudit a, keyeddata t
  WHERE a.updatedatetime < '2025-10-01' AND
  (a.category is not distinct from t.category) AND
  (a.key1 is not distinct from t.key1 ) AND
  (a.key2 is not distinct from t.key2 ) AND
  (a.key3 is not distinct from t.key3 ) AND
  (a.key4 is not distinct from t.key4 ) AND
  (a.parameter is not distinct from t.parameter ) AND
  a.version < t.version;

Or if I try

select count(*) from keyeddataaudit a, keyeddata t
  WHERE a.updatedatetime < '2025-10-01' AND
  (a.category = t.category OR (a.category is null and t.category is null)) AND
  (a.key1 = t.key1 OR (a.key1 is null and t.key1 is null) ) AND
  (a.key2 = t.key2 OR (a.key2 is null and t.key2 is null) ) AND
  (a.key3 = t.key3 OR (a.key3 is null and t.key3 is null) ) AND
  (a.key4 = t.key4 OR (a.key4 is null and t.key4 is null) ) AND
  (a.parameter = t.parameter OR (a.parameter is null and t.parameter is null) ) AND
  a.version < t.version;

Then in both of these the query has run for over 5 minutes without completing before I cancelled it.

How do I change either the indexes or the queries so it actually uses them, please? Or alternatively is there some other way to achieve this with reasonable performance?

Edit: This completes in 48 seconds:

 select count(*) from spreads.keyeddataaudit a, spreads.keyeddata t
  WHERE a.updatedatetime < '2025-10-01' AND
  (coalesce(a.category, 'null') = coalesce(t.category, 'null')) AND
  (coalesce(a.key1, 'null') = coalesce(t.key1, 'null')) AND
  (coalesce(a.key2, 'null') = coalesce(t.key2, 'null')) AND
  (coalesce(a.key3, 'null') = coalesce(t.key3, 'null')) AND
  (coalesce(a.key4, 'null') = coalesce(t.key4, 'null')) AND
  (coalesce(a.parameter, 'null') = coalesce(t.parameter, 'null')) AND
  a.version < t.version;

But would get confused if data ever contained 'null'

Which is solved by this, which completes in 1m29s.

 select * from spreads.keyeddataaudit a, spreads.keyeddata t
  WHERE a.updatedatetime < '2025-10-01' AND
  (coalesce(a.category, 'null') = coalesce(t.category, 'null') and coalesce(a.category, 'null2') = coalesce(t.category, 'null2')) AND
  (coalesce(a.key1, 'null') = coalesce(t.key1, 'null') and coalesce(a.key1, 'null2') = coalesce(t.key1, 'null2')) AND
  (coalesce(a.key2, 'null') = coalesce(t.key2, 'null') and coalesce(a.key2, 'null2') = coalesce(t.key2, 'null2')) AND
  (coalesce(a.key3, 'null') = coalesce(t.key3, 'null') and coalesce(a.key3, 'null2') = coalesce(t.key3, 'null2')) AND
  (coalesce(a.key4, 'null') = coalesce(t.key4, 'null') and coalesce(a.key4, 'null2') = coalesce(t.key4, 'null2')) AND
  (coalesce(a.parameter, 'null') = coalesce(t.parameter, 'null') and coalesce(a.parameter, 'null2') = coalesce(t.parameter, 'null2')) AND
  a.version < t.version;

But it really feels like there should be a better way!

2
  • I'm trying to understand the data model. It seems unexpected that the version column is not included in the index for the "keyeddata" table (category, key1, key2, key3, key4, parameter). In other words, there can be only 1 row in this table, with one value of the version column. In this case, the "keyeddataaudit" table can have many rows with different version values. So these tables can vary significantly in the number of rows? And why are you using a cartesian product of tables, not a join? Commented Oct 28 at 23:16
  • @ValNik Yes, the audit table can have more rows than the main table since it tracks changes to it. I was using that syntax to try and better map to what I will need to do for a delete query since you can't use a join in a delete (unless I'm missing something) Commented Oct 29 at 9:50

1 Answer 1

1

I would say that there is no better way.

IS NOT DISTINCT FROM is not supported by indexes, and OR is usually a performance problem anyway. But the biggest problem with your statements is that they join two tables, and none of the join conditions compares using =. In such a case, the only join strategy left is a nested loop join, which tends to perform terribly with two bigger tables. The statements that use coalesce() join using =, so you can get a hash or a merge join, depending on the size of the tables. That is most likely the reason for the observed performance difference.

I'd say that the root of your problem are the NULL values. If you had defined the columns as NOT NULL, there wouldn't be a problem.

Sign up to request clarification or add additional context in comments.

4 Comments

Unfortunately, null values are expected in that table. It seems insane to me that postgres allows null values but can't index or use them in any way useful.
PostgreSQL can index NULL values. But the (standard-imposed) semantics of NULL are different from those of normal values. In other words: all relational databases are insane.
NULL means unknown value, and you cannot compare two values that you don't know. As you want to interpret NULL differently, like a special value, it makes sense that you have to use a COALESCE expression that sets the same special value on both sides. > But it really feels like there should be a better way! Yes, not using NULL but a special value (Empty string? Special character?) . Naming it "key" is a sign that you should not use null. A primary key should have all its values known (i.e NOT NULL) at time of insert.
Null also means empty. It's a composite key; parts of it can be empty. I'm now having to hack in special meanings such as empty strings to avoid the nulls to work around a ridiculous limitation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.