PostgreSQL: Delete Multiple Rows with multiple Conditions from the same table

Question

I have a dilemma that makes for sort of curious experiment to be conducted but I was wondering if it already has been done.

I have a table:

create table test(action_date date, item_num int, <various irrelevant columns>,
                  primary key(action_date, item_num));

I have data that is coming in that contains the action_date, item_num and the rest of information for the <various irrelevant columns> I need to update data in the table quickly.

The fastest way I found was to:

delete the data first
insert the new data

The issue I am having is how to delete data in the fastest way possible:

The new data contains 1000s of items and 100s of dates, so I can do one of 2 things:

I can do a single query: delete from test where (item_num = item_num1 and action_date >= action_date1) or (item_num = item_num2 and action_date >= action_date2) or ...
Execute as a single transaction a set of queries of the delete from test where item_num = item_num1 and action_date >= action_date1; delete from test where item_num=item_num2 and action_date >= action_date2; ....

The question is which way would be faster?

P.S. I could conduct experiments and see which way would be faster but I was wondering if this was already done somewhere. My own searches yielded nothing.

Updating data in a table by deleting it and then re-inserting the new data isn't scalable and isn't going to be most performant in the long run. If you index your table appropriately then just updating will be most performant. Regardless, how you indexed your table will also be the answer to your question of the fastest way to delete from your table as well. So please update your question to include your indexes. — J.D.
– J.D., Commented Jan 14, 2021 at 15:43
Additionally if your "irrelevant columns" don't matter then why don't you normalize the table into two tables, where one table is only the columns you care about for performant updates? Lastly, your second option is likely to be the better choice, because using OR in your predicates typically affect how indexes can be used, but again it mostly depends on what indexes you have. — J.D.
– J.D., Commented Jan 14, 2021 at 15:45
@J.D. Irrelevant columns don't matter for the purposes of this discussion but they do matter. As far as updating data the as the data comes in there are some data points which could already exist and some data points which do not. So I have 2 options of upserting. Either use upsert or delete + bulk copy. On the data sizes that I am dealing with delete + bulk_copy works faster then upsert. — Karlson
– Karlson, Commented Jan 14, 2021 at 15:55
See Erwin's answer which reinforces my points about your indexes. If you're currently not indexing your table (my best guess since you haven't included that information) then you can't compare a valid test between upserting or not, since efficient updates and deletes are dependent on indexes. — J.D.
– J.D., Commented Jan 14, 2021 at 15:58
@J.D.: The (included) PK provides the needed index, just in an inferior way, currently. — Erwin Brandstetter
– Erwin Brandstetter, Commented Jan 14, 2021 at 16:01

Erwin Brandstetter · Accepted Answer · 2021-01-14 16:29:39Z

I don't expect a big difference between both queries. Details depend on undisclosed information and data distribution.

What actually makes a big difference is to have an index supporting it optimally. One might think that you already have that with your PK:

primary key(action_date, item_num)

But that's not so. The sequence of index expressions matters in this case. Instead make it:

primary key(item_num, action_date)

Rule of thumb: equality first, range later. See:

Multicolumn index and performance

You may want to create an index on (action_date) or even (action_date, item_num) additionally - or vice versa - for those other use cases you mentioned in a comment. But you must have the index on (item_num, action_date) one way or another for optimal performance. See:

Is a composite index also good for queries on the first field?

As long as those other queries have equality predicates on both item_num and action_date, either index is equally good, and you don't need another. See (again):

Multicolumn index and performance

Try to keep the number and size of your indexes to a minimum, as they themselves are a burden on write performance. Rule of thumb: as many as necessary, as few as possible.

Whether you DELETE and INSERT, or UPDATE (UPSERT), you produce a lot of dead tuples either way. So make sure to have appropriate autovacuum settings for the table. Default settings may not be aggressive enough if the table is big and the churn is high. The table needs to be vacuumed to be able to reuse space from dead tuples or do HOT updates. Else, the table, and even more so the index, start to bloat and performance goes down. See:

In extreme cases vertical partitioning may pay. See:

Many columns vs few tables - performance wise

Stack Exchange Network

PostgreSQL: Delete Multiple Rows with multiple Conditions from the same table

1 Answer 1

Your Answer

Linked

Hot Network Questions

PostgreSQL: Delete Multiple Rows with multiple Conditions from the same table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions