Optimizing aggregation function and ordering in PostgreSQL

Question

I have the following table 'medicion' with the followings fields:

id_variable[int](PK), 
id_departamento[int](PK), 
fecha [date](PK), 
valor [number]`.

So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:

SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;

Knowing that by default PostgreSQL builds an index for the primary key

(id_departamento, id_variable, fecha)

how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?

Thanks!

Add multicolum covering index on id_variable, valor. PostgreSQL will scan the index instead of the table. It must scan the whole index (or table) because AVG function is used. AVG must always scan all rows to calculate the average. — krokodilko
– krokodilko, Commented Aug 3, 2017 at 19:54
So what's the adventage if I create the index? I mean, PostgreSQL will scan the whole table/index anyway. — Tomás Juárez
– Tomás Juárez, Commented Aug 3, 2017 at 19:58
@krokodilko Im with Tomi here. Without where db will do a full scan so index wont help. Now if you add WHERE id_variable = <something> that covering index will help. — Juan Carlos Oropeza
– Juan Carlos Oropeza, Commented Aug 3, 2017 at 20:14

bobflux · Accepted Answer · 2017-08-03 20:08:51Z

Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.

The only things an extra covering index brings are:

Not reading the entire table.

This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.

I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.

...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.
Getting the rows pre-sorted for convenient aggregation.

This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.

Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...

If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.

You could keep your stats in cache if you don't mind them being stale.

Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.

Collectives™ on Stack Overflow

Optimizing aggregation function and ordering in PostgreSQL

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related