1

I have the following table 'medicion' with the followings fields:

id_variable[int](PK), 
id_departamento[int](PK), 
fecha [date](PK), 
valor [number]`.

So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:

SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;

Knowing that by default PostgreSQL builds an index for the primary key

(id_departamento, id_variable, fecha)

how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?

Thanks!

5
  • You have id_valor or valor ? both or there is a typo? Commented Aug 3, 2017 at 19:47
  • It's a typo, sorry Commented Aug 3, 2017 at 19:54
  • 1
    Add multicolum covering index on id_variable, valor. PostgreSQL will scan the index instead of the table. It must scan the whole index (or table) because AVG function is used. AVG must always scan all rows to calculate the average. Commented Aug 3, 2017 at 19:54
  • 1
    So what's the adventage if I create the index? I mean, PostgreSQL will scan the whole table/index anyway. Commented Aug 3, 2017 at 19:58
  • @krokodilko Im with Tomi here. Without where db will do a full scan so index wont help. Now if you add WHERE id_variable = <something> that covering index will help. Commented Aug 3, 2017 at 20:14

1 Answer 1

1

Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.

The only things an extra covering index brings are:

  • Not reading the entire table.

This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.

I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.

  • ...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.

  • Getting the rows pre-sorted for convenient aggregation.

This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.

Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...

If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.

You could keep your stats in cache if you don't mind them being stale.

Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.