Slow query on big table - with many different filters possible

Question

I have a table:

create table accounts_service.operation_history (
 history_id     bigint       generated always as identity primary key,
 operation_id   varchar(36)  not null unique,
 operation_type varchar(30)  not null,
 operation_time timestamptz  default now()  not null,
 from_phone     varchar(20),
 user_id        varchar(21),
-- and a lot of other varchar(x), text and even couple of number, boolean, jsonb, timestamp columns
);


create index operation_history_user_id_operation_time_idx
    on accounts_service.operation_history (user_id, operation_time);

create index operation_history_operation_time_idx
    on accounts_service.operation_history (operation_time);

I want to make a simple SELECT with a WHERE filter on operation_time (this is a required filter and can be a day or two) as well as additional filters for other columns: commonly, with varchar(x) type.

But my queries are slow:

explain (buffers, analyze)
select *
from operation_history operationh0_
where (null is null or operationh0_.user_id = null)
  and operationh0_.operation_time >= '2024-09-30 20:00:00.000000 +00:00'
  and operationh0_.operation_time <= '2024-10-02 20:00:00.000000 +00:00'
  and (operationh0_.from_phone = '+000111223344')
order by operationh0_.operation_time asc, operationh0_.history_id asc
limit 25;

Limit  (cost=8063.39..178328.00 rows=25 width=1267) (actual time=174373.106..174374.395 rows=0 loops=1)
  Buffers: shared hit=532597 read=1433916
  I/O Timings: read=517880.241
  ->  Incremental Sort  (cost=8063.39..198759.76 rows=28 width=1267) (actual time=174373.105..174374.394 rows=0 loops=1)
        Sort Key: operation_time, history_id
        Presorted Key: operation_time
        Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
        Buffers: shared hit=532597 read=1433916
        I/O Timings: read=517880.241
        ->  Gather Merge  (cost=1000.60..198758.50 rows=28 width=1267) (actual time=174373.099..174374.388 rows=0 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              Buffers: shared hit=532597 read=1433916
              I/O Timings: read=517880.241
              ->  Parallel Index Scan using operation_history_operation_time_idx on operation_history operationh0_  (cost=0.57..197755.24 rows=12 width=1267) (actual time=174362.932..174362.933 rows=0 loops=3)
                    Index Cond: ((operation_time >= '2024-09-30 20:00:00+00'::timestamp with time zone) AND (operation_time <= '2024-10-02 20:00:00+00'::timestamp with time zone))
                    Filter: ((from_phone)::text = '+000111223344'::text)
                    Rows Removed by Filter: 723711
                    Buffers: shared hit=532597 read=1433916
                    I/O Timings: read=517880.241
Planning Time: 0.193 ms
Execution Time: 174374.449 ms

For simplicity:

set max_parallel_workers_per_gather = 0;

It's just simplifying plan, numbers are relevant. Retry the previous query:

Limit  (cost=7535.40..189179.35 rows=25 width=1267) (actual time=261432.728..261432.729 rows=0 loops=1)
  Buffers: shared hit=374346 read=1591362
  I/O Timings: read=257253.065
  ->  Incremental Sort  (cost=7535.40..210976.63 rows=28 width=1267) (actual time=261432.727..261432.727 rows=0 loops=1)
        Sort Key: operation_time, history_id
        Presorted Key: operation_time
        Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
        Buffers: shared hit=374346 read=1591362
        I/O Timings: read=257253.065
        ->  Index Scan using operation_history_operation_time_idx on operation_history operationh0_  (cost=0.57..210975.37 rows=28 width=1267) (actual time=261432.720..261432.720 rows=0 loops=1)
              Index Cond: ((operation_time >= '2024-09-30 20:00:00+00'::timestamp with time zone) AND (operation_time <= '2024-10-02 20:00:00+00'::timestamp with time zone))
              Filter: ((from_phone)::text = '+000111223344'::text)
              Rows Removed by Filter: 2171134
              Buffers: shared hit=374346 read=1591362
              I/O Timings: read=257253.065
Planning Time: 0.170 ms
Execution Time: 261432.774 ms

So it filtered just 2 171 134 rows and it was more than 4 mins. Seems it is too long, isn't it?

I tried selecting specific columns (e.g. operation_time, from_phone, to_phone, history_id), it had no effect.
I tried vacuum analyze, it had no effect.
I checked some parameters of Postgres, like shared_buffers, work_mem, etc. Changing it has no effect.
I compared it with pgTune and it's ok.

Some another info:

SELECT relpages, pg_size_pretty(pg_total_relation_size(oid)) AS table_size
FROM pg_class
WHERE relname = 'operation_history';


18402644	210 GB

select count(*) from operation_history;


352402877

Server drives: AWS gp3
I don't want to create indexes for all columns because there are massive writes to this table...

Is there any way to optimize it?
Or is it just making a lot of reads from the index and the table and it's ok and we need to do sharding, etc?

UPD: I checked index bloat with this query:

idxname	real_size	extra_size	extra_pct	fillfactor	bloat_size	bloat_pct	is_na
operation_history_operation_time_idx	7839301632	746373120	9.520913405772113	90	0	-0.6147682314362566	false

I checked table bloat with this query

tblname	real_size	extra_size	extra_pct	fillfactor	bloat_size	bloat_pct	is_na
operation_history	150754459648	7987224576	5.298168024116535	100	7987224576	5.298168024116535	false

And it seems to be ok.

Is this a copy-paste mistake or do you really have this part in your query? where (null is null or operationh0_.user_id = null) This looks quite incorrect and I don't get what this condition should do. Did you rather want operationh0_.user_id IS NULL? — Jonas Metzler
– Jonas Metzler, Commented Oct 26, 2024 at 9:16
(null is null or x) looks like that's an optional condition which wasn't used in the app the request originated from - it's ugly but it doesn't matter because PostgreSQL optimises that away. Production env logs are typically full of queries with a bunch of 1=1, x=coalesce(null,x) and all sorts of and true resulting from params being left unused and defaulting to something neutral and reducible. — Zegarek
– Zegarek, Commented Oct 26, 2024 at 9:37
@Zegarek: But operationh0_.user_id = null is logical nonsense nonetheless. — Erwin Brandstetter
– Erwin Brandstetter, Commented Oct 26, 2024 at 9:39
"I don't want to create indexes for all columns". As some great philosophers once said, you can't always get what you want. Have you tried creating the indexes? Were the consequences worse than the consequences of not having them? — jjanes
– jjanes, Commented Oct 26, 2024 at 16:14

Erwin Brandstetter · Accepted Answer · 2024-10-27 14:52:15Z

2

I/O

I compute these I/O performance indicators from your (non-parallel) query plan:

SELECT 1591362::bigint * 1000 / 257253.065;                         -- 6186 IOPS
SELECT pg_size_pretty(1591362::bigint * 8196 * 1000 / 257253.065);  -- 48 MB/s I/O throughput

Amazon advertises their "gp3" storage with at least:

3,000 IOPS free
125 MB/s free

And:

Max IOPS/Volume: 16,000
Max Throughput/Volume: 1,000 MB/s

You commented:

Correlation for operation_time is 0.9960051

Traversing the index operation_history_operation_time_idx, reads from index and heap should be mostly sequential. (Unless their storage system works in unexpected ways.)

So, IOPS seems ok, but the I/O throughput seems bad. Both could be better.
What's your exact contract with Amazon?

You can pay up for better I/O with "gp3".
For more money, yet, they offer even better I/O performance:

For applications that need higher durability, latency, or IOPS than gp3 can provide, we recommend using io2 Block Express volumes.

(As do a range of other cloud providers.)

Indexing

For the query at hand: it filters all 2171134 rows that qualify after the filters on operation_time. What it really would need is an index on (from_phone), ideally in its combined form: (from_phone, operation_time) , fields in this order. See:

Multicolumn index and performance

Would be a Get Out of Jail Free card for this query. But you mentioned variations. The more fields can be filtered, the harder it gets to cover everything with indexes.

The ...

filter on operation_time [...] can be a day or two

This might be a promising angle of attack. Partition the table on operation_time with a 1-day partition size. Especially if your queries focus on recent days. Then Postgres can just read whole partitions sequentially without index, or you can create indexes per partition, and drop them for older ones. Those will be smaller than total indexes by orders of magnitude, easier to cache.

Also, if you don't actually need SELECT * in your query, there may be options for index-only scans ...

Table design

Your table design is wasteful, which is bad for storage and performance every step of the way. Example:

operation_id varchar(36)

I smell a UUID in ugly disguise. See:

Would index lookup be noticeably faster with char vs varchar when all values are 36 chars

Server / Maintenance

Upgrading from the aging Postgres version 13 (EOL next year) should help.
And more RAM, faster I/O, obviously.
And/or more CPUs, coupled with a higher setting for max_parallel_workers_per_gather.

If the total relation size of 210 GB is much bigger than expected, tables and indexes might be bloated. (Investigate to verify!) The brute-force, built-in tool to fix - with a long exclusive lock on the table:

VACUUM FULL ANALYZE accounts_service.operation_history;

There are smarter alternatives:

Alternative to CLUSTER without table lock

After a rough calculation, probably not bloated. What I see here is already good for at least 110 GB in pristine condition. And you mentioned more columns. So that's not it.

Asides

Concerning correctness, barely performance:

AND o.operation_time >= '2024-09-30 20:00:00.000000 +00:00'
AND o.operation_time <= '2024-10-02 20:00:00.000000 +00:00'

It rarely makes sense to include lower and upper bound in such a scenario. Should probably be:

AND o.operation_time >= '2024-09-30 20:00:00.000000 +00:00'
AND o.operation_time <  '2024-10-02 20:00:00.000000 +00:00'

edited Oct 27, 2024 at 14:52

answered Oct 26, 2024 at 9:47

Erwin Brandstetter

668k159 gold badges1.2k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Frank Heikens Over a year ago

The I/O performance you get from AWS is abysmally bad. That depends on what you pay for. AWS, or any other cloud provider I know, hasn't a fixed and standard IO performance. When you need 1000 IOPS, you pay for 1000 IOPS and you get 1000 IOPS, it's just perfect. For AWS RDS can you order anything between 1,000 and 256,000 IOPS. It's up to you.

Erwin Brandstetter Over a year ago

@FrankHeikens I didn't mean in general, but for the query at hand. The specifications for "gp3" linked in the question claim "max. 1.000 MB/s" throughput, but I compute less than 50 from the query plan ...

Alexey Stepanov Over a year ago

Thanks for the answer! I have already checked the size of the index and table, and it looks good (as well as in your calculations). I updated the answer with the results. I also checked without including the lower bound, and it became faster by just 1 second:( Could you please describe how you calculated the throughput for the query plan?

Erwin Brandstetter Over a year ago

@jjanes: I replaced my rant about I/O throughput with more substance on the numbers. (Also showing my calculation for Alexey.) Added some more while being at it.

Frank Heikens Over a year ago

The max throughput depends on what you ordered. In my experience, they give predictable performance. In addition, you might receive some extras in the form of performance bursts.

|

O. Jones · Accepted Answer · 2024-10-27 11:52:56Z

0

It's possible you can create a multicolumn BTREE ("indexed sequential") index that will support your query more efficiently. Try this one.

create index whatever_name
    on accounts_service.operation_history
      (from_phone, user_id, operation_time, history_id);

This index may be astonishingly fast because:

Your from_phone = 'whatever' filter is probably very selective, ruling out almost all the rows of the table.
Your user_id IS NULL filter isn't as selective, but it is still a kind of equality filter.
You have a range filter on operation_time, and it is also one of your ordering columns. 1, Your other ordering column is history_id.

If my guess about this index is correct, your filter will be satisfied by random-accessing the index to the first eligible row, then scanning it sequentially for your limit of 25 rows (or until the first ineligible row).

And, if you include this clause in the index

   include (operation_type)

It will be a so-called covering index. That is, the query planner can satisfy your query from the index alone, without needing to refer to the table's heap. That should make it very fast indeed.

Pro tip When you have query performance problems, it's very wise to avoid SELECT * and instead give the names of the columns you need. Leaving out unneeded columns may allow the query planner to avoid expensive data slinging.

answered Oct 27, 2024 at 11:52

O. Jones

110k17 gold badges134 silver badges187 bronze badges

2 Comments

Zegarek Over a year ago

You can see in the plan that user_id is null condition gets discarded:and(null is null or user_id is null) becomes and(true or user_id is null) then a neutral and true that's just thrown away. That being said, including it might help other queries where the params behind those null get used. What you suggest won't be a covering index - the plan shows the row widths are above 1200 and OP stated there are way more columns beyond these few. Still, even rows this wide should be able to fit under the include payload size limit. The index will swell up to about the size of the table though

O. Jones Over a year ago

Oh, yeah, you're right, I missed that. That means trying create index whatever_name on accounts_service.operation_history (from_phone, operation_time, history_id); Omit the user_id column from the multicolumn index.

jjanes · Accepted Answer · 2024-10-27 19:33:05Z

So it filtered just 2 171 134 rows and it was more than 4 mins. Seems it is too long, isn't it?

In order to filter out that many rows, you first have to fetch that many rows. And the fetching of the rows is where the time is going. So you need to get it to remove the mismatching rows without fetching them from table. A way to do this is with a multicolumn index, for example:

create index on operation_history (operation_time, colA, colB, colC, colD, from_phone);

Once from_phone is in the index, it can filter out the mismatched-phone rows based just on the index and not need to hit the table. This index should be useful for simple filters on any of the non-lead columns in the index (provided you always have a range criterion on operation_time, which you indicated you did). It would be more efficient to have an index on (from_phone, operation_time), but then you would need severl other analogous indexes, which you apparently don't want to do. By putting operation_time first, you give up some absolute efficiency but gain the flexibility to add several more columns without needing to add several more indexes. You would not want to include particularly large columns in the index, like JSONB is likely to be, as those are unlikely to actually be helpful for filtering, as well as making the index gratuitously large. You also shouldn't include in this index columns which are not likely used for filtering but are simply there to get displayed once the rows are selected based on values in other columns.

0.996 seems like a very high correlation, but based on some data I simulated that remaining 0.004 can hide a surprising amount of disorder which in turn drives a lot of random IO. The amount of disorder I find is not nearly as much as you what you seem to have, but it is a lot more than I expected to find. I didn't simulate 210GB of data, as I have neither the hard drive nor the patience to do that, I did it at 1/100 scale. I don't know if the remaining discrepancy is due to the reduced size of my data, or if it is because your data has some awkward regularity to it that I can't simulate.

I was initially thinking partitioning the table on operation_time would be helpful to force more clustering on that column. However, your data is already highly clustered, just not highly enough. So I don't think partitioning would do anything for you in this query. Using the CLUSTER command to cluster the table perfectly (not just 0.996) on operation_time would almost certainly help, but it likely wouldn't remain clustered as more data is added.

Collectives™ on Stack Overflow

Slow query on big table - with many different filters possible

3 Answers 3

I/O

Indexing

Table design

Server / Maintenance

Asides

9 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

I/O

Indexing

Table design

Server / Maintenance

Asides

9 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related