"distinct on" with group by postgres

Question

I have the following records:

id  run_hour               performance_hour      value
2  "2017-06-25 09:00:00"  "2017-06-25 07:00:00"    6
2  "2017-06-25 09:00:00"  "2017-06-25 08:00:00"    5
1  "2017-06-25 09:00:00"  "2017-06-25 08:00:00"    5
2  "2017-06-25 08:00:00"  "2017-06-25 07:00:00"    5
1  "2017-06-25 08:00:00"  "2017-06-25 07:00:00"    5

We run every hour to look at the result of each id for current hour and previous hours.

Only if there was a change from previous hour run we insert a new reocrd (we don't want to overwrite the value because we want to measure the value if looked after 1 hour, or 2 hours etc.

I would like to sum for each id in the latest available value (sort by run_hour) - the values.

In the above example, ad 1 for run 9:00 and performance hour of 7:00 doesn't have a record - because it was the same as the run of 8:00 and performance hour of 7:00

In the above example, if I ask for sum of values of run 2017-06-25 09:00:00 I would expect to get:

id, value
1   10
2   11

for id 1, it's 10 calculated: (run_hour<2017-06-25 08:00:00> + run_hour<2017-06-25 09:00:00>) and for id 2, it's 11 calculated: (run_hour<2017-06-25 09:00:00> + run_hour<2017-06-25 09:00:00>) I wrote the following query:

select distinct on (id, run_hour) id, sum(value) from metrics where  run_hour <= '2017-06-25 09:00' and performance_hour >= '2017-06-25 07:00' and  performance_hour < '2017-06-25 09:00'
group by id
order by id, run_hour

However I get an error that also run_hour has to be in the GROUP BY clause. - but if I add it I get incorrect data - also data for previous hours which I don't need - I need the latest hour that had data.

How can I use "distinct on" with group by?

What makes you decide for the record with value 6 for id=2 , run_hour= "2017-06-25 09:00:00"? Because its earlier performance_hour? — Thorsten Kettner
– Thorsten Kettner, Commented Jun 28, 2017 at 14:28
It's data that I get every hour about performance of our ads. when I ran at 9:00 for hour 7 - I got value 6, which is different that the previous run for that hour (8:00 for hour 7 and ad 2) so I insert a new record — Dejell
– Dejell, Commented Jun 28, 2017 at 14:29
No, I don't understand. You have two records for id=2 , run_hour= "2017-06-25 09:00:00". You take the one with value 6 and dismiss the one with value 5 when building the sum. Why that record and not the other one? — Thorsten Kettner
– Thorsten Kettner, Commented Jun 28, 2017 at 14:31
the performance hour is different. for id at 9:00 for performance between 7-9 the sum is 11, however, for id 1, there is no record for run hour 9:00 and performance record 8:00, because I retrieved the same data from the external source so I didn't want to create a redundant record (but since the value is the same - I need to use the record from run 8:00 to performance 7:00 for id 1) — Dejell
– Dejell, Commented Jun 28, 2017 at 14:33
Okay, I think I got it now. Complicated. I'll have to think of a solution... — Thorsten Kettner
– Thorsten Kettner, Commented Jun 28, 2017 at 14:37

Thorsten Kettner · Accepted Answer · 2017-06-28 19:41:24Z

2

The task is very complicated. Let's say you want the performance hours 7:00 till 9:00 from the following data:

id  run_hour               performance_hour      value
2   "2017-06-25 09:00:00"  "2017-06-25 06:00:00"    6
2   "2017-06-25 09:00:00"  "2017-06-25 10:00:00"    5

The expected result would be 18 (6 for 7:00 + 6 for 8:00 + 6 for 9:00) all based on the 6:00 record which itself is outside the desired time range.

We need a recursive CTE starting from the first wanted performance hour per id till the last wanted one. Thus we build records that don't exist and that we can sum up later.

with recursive cte(id, run_hour, performance_hour, value) as
(
  select *
  from
  (
    select distinct on (id) 
      id, 
      run_hour,
      greatest(performance_hour, timestamp '2017-06-25 07:00') as performance_hour, 
      value
    from metrics
    where run_hour = timestamp '2017-06-25 09:00' 
      and performance_hour <= timestamp '2017-06-25 07:00'
    order by id, metrics.performance_hour desc
  ) start_by_id
  union all
  select 
    cte.id, 
    cte.run_hour,
    cte.performance_hour + interval '1 hour' as performance_hour,
    coalesce(m.value, cte.value) as value
  from cte
  left join metrics m on m.id = cte.id
                      and m.run_hour = cte.run_hour
                      and m.performance_hour = cte.performance_hour + interval '1 hour'
  where cte.performance_hour < timestamp '2017-06-25 09:00'
)
select id, sum(value)
from cte
group by id;

Rextester link: http://rextester.com/PHC88770

edited Jun 28, 2017 at 19:41

answered Jun 28, 2017 at 15:28

Thorsten Kettner

96.6k8 gold badges56 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Dejell Over a year ago

why the expected result will be 18 and not 11 for id: 2? I need the most recent run_time record if exists

Thorsten Kettner Over a year ago

Okay, I missed the part that you want the time till current hour. Yes, this would include the second record, of course. The question however remains how many previous hours. If you look at 7:00 to 10:00 you'll get a sum of 23. If you look at 9:00 to 10:00 you get a sum of 11. If you look at 9:00 to 12:00 you get 21. Just adjust the timestamps in the query so as to get the range you want.

Thorsten Kettner Over a year ago

I had a bug in my query, though. In the DISTINCT ON query I ordered by id, performance_hour instead of id, metrics.performance_hour. That could pick the wrong record due to ambiguity. Well, that comes from using an alias name that is also a column name :-| I've updated my answer and the rextester test accordingly.

Dejell Over a year ago

Thanks. Do you think that there is a better db to save this type of information without the need to look recursively? rather than postgresSQL

Thorsten Kettner Over a year ago

Another DB or another DBMS? The database doesn't seem bad, but you are not writing every record. Though the data is regarded to come in hourly, you omit records when they have the same value as the previous. Thus your table has gaps. This makes your query so complicated; as a first step you must generate the missing rows (recursive query). If you want it easier, you'll have to write the records, so there are no gaps. Or use a programming language rather than SQL for calculation. As to your database management system: PostgreSQL is very good; I see no reason why you would change it for another.

Gordon Linoff · Accepted Answer · 2017-06-29 02:42:50Z

1

You want the distinct on before the group by:

select id, sum(value)
from (select distinct on (id, run_hour) m.*
      from metrics m
      where run_hour <= '2017-06-25 09:00' and
            performance_hour >= '2017-06-25 07:00' and
            performance_hour < '2017-06-25 09:00'
      order by id, run_hour, performance_hour desc
     ) m
group by id;

edited Jun 29, 2017 at 2:42

answered Jun 28, 2017 at 14:17

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

4 Comments

Dejell Over a year ago

I get missing FROM-clause entry for table "m"

Thorsten Kettner Over a year ago

Change m.* to metrics.*.

Dejell Over a year ago

it doesn't work. I get for id: 2 the sum of 10 instead of 11

Dejell Over a year ago

It takes the run of the 8:00 for id 2, however it should take the value of the run of 9:00 which returned 6

Collectives™ on Stack Overflow

"distinct on" with group by postgres

2 Answers 2

5 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related