0

I'm learning PostgreSQL Clustering abilities and I would like to compare performance of the same query with table not clustered and with table clustered.

I tried to generate 25 million user events and run query before clustering and after. Yet, running EXPLAIN ANALYSE doesn't give monotonic results in each case, and it's hard to compare the values. I mean, running it before clusterization results in query time ~100-200ms, and after clusterization results in somewhat similar, though I see that Heap Fetches: 0 in that case.

My question is how do I benchmark query before and query after to analyze it? Are there any tools available that allow to do it? Maybe I can collect stats from multiple query runs and get the visualization of percentiles in each case?

I have seen that it's possible to collect the sum of values and to compare it, but isn't it possible to get percentiles somehow? Maybe you use some data visualization tools for that?

3
  • That depends on your table&index definition and what query you're trying to optimise. Attach the DDL, your query and the full plan. Depending on where you saw Heap Fetches: 0 it could mean your test ended up being fine with an index-only scan, for example, and in that case, it wouldn't matter if your table is freshly clustered or a complete unordered mess full of unvacuumed tuples, because it's not being scanned at all. Everything came from the index, which is a separate object, entire point of which is to keep things ordered inside it. Commented Jun 18 at 19:27
  • EXPLAIN ANALYZE is the right tool but it's just that not all queries will benefit from their target table being CLUSTERed - if yours can get everything it needs from the index, that's actually better. If you're certain CLUSTER should help, make sure the table is ANALYZEd after clustering, before you start your tests. Once your measurements start making sense, pg_bench. Commented Jun 18 at 19:39
  • Why do you want to use CLUSTER in the first place? What is the real problem you try to solve? Your best friend is explain(analyze, verbose, buffers, settings). And you have to share the results if you need help with it. Commented Jun 18 at 19:52

1 Answer 1

2

To answer your question as it is asked: you benchmark your queries by running them. To get dependable values, you have to run them repeatedly, ideally with different constants. Running the query with different constants reduces the effects of caching (the statement becomes faster and faster, because more and more of its data are cached in RAM). One tool that you can use for that purpose is the built-in pgbench with a custom script.

To answer the unasked question that I suspect is your actual problem: Your query is performing an index-only scan, and your attempts to improve the performance by running CLUSTER on an underlying table failed. That is hardly surprising, since an index-only scan that doesn't perform any heap fetches is independent of the physical order of the heap table, because the heap table is not even accessed. CLUSTER rewrites the table, but the index will look pretty much the same and won't perform any different.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for pointing out regarding caching. Also, one more thing that I didn't realize was that my data before clustering was already in the same order as if clustered (due to the sample generation). After fixing this, I have actually got to the point that CLUSTERED gives better performance.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.