1

I have the following records:

id  run_hour               performance_hour      value
2  "2017-06-25 09:00:00"  "2017-06-25 07:00:00"    6
2  "2017-06-25 09:00:00"  "2017-06-25 08:00:00"    5
1  "2017-06-25 09:00:00"  "2017-06-25 08:00:00"    5
2  "2017-06-25 08:00:00"  "2017-06-25 07:00:00"    5
1  "2017-06-25 08:00:00"  "2017-06-25 07:00:00"    5

We run every hour to look at the result of each id for current hour and previous hours.

Only if there was a change from previous hour run we insert a new reocrd (we don't want to overwrite the value because we want to measure the value if looked after 1 hour, or 2 hours etc.

I would like to sum for each id in the latest available value (sort by run_hour) - the values.

In the above example, ad 1 for run 9:00 and performance hour of 7:00 doesn't have a record - because it was the same as the run of 8:00 and performance hour of 7:00

In the above example, if I ask for sum of values of run 2017-06-25 09:00:00 I would expect to get:

id, value
1   10
2   11

for id 1, it's 10 calculated: (run_hour<2017-06-25 08:00:00> + run_hour<2017-06-25 09:00:00>) and for id 2, it's 11 calculated: (run_hour<2017-06-25 09:00:00> + run_hour<2017-06-25 09:00:00>) I wrote the following query:

select distinct on (id, run_hour) id, sum(value) from metrics where  run_hour <= '2017-06-25 09:00' and performance_hour >= '2017-06-25 07:00' and  performance_hour < '2017-06-25 09:00'
group by id
order by id, run_hour

However I get an error that also run_hour has to be in the GROUP BY clause. - but if I add it I get incorrect data - also data for previous hours which I don't need - I need the latest hour that had data.

How can I use "distinct on" with group by?

5
  • What makes you decide for the record with value 6 for id=2 , run_hour= "2017-06-25 09:00:00"? Because its earlier performance_hour? Commented Jun 28, 2017 at 14:28
  • It's data that I get every hour about performance of our ads. when I ran at 9:00 for hour 7 - I got value 6, which is different that the previous run for that hour (8:00 for hour 7 and ad 2) so I insert a new record Commented Jun 28, 2017 at 14:29
  • No, I don't understand. You have two records for id=2 , run_hour= "2017-06-25 09:00:00". You take the one with value 6 and dismiss the one with value 5 when building the sum. Why that record and not the other one? Commented Jun 28, 2017 at 14:31
  • the performance hour is different. for id at 9:00 for performance between 7-9 the sum is 11, however, for id 1, there is no record for run hour 9:00 and performance record 8:00, because I retrieved the same data from the external source so I didn't want to create a redundant record (but since the value is the same - I need to use the record from run 8:00 to performance 7:00 for id 1) Commented Jun 28, 2017 at 14:33
  • 1
    Okay, I think I got it now. Complicated. I'll have to think of a solution... Commented Jun 28, 2017 at 14:37

2 Answers 2

2

The task is very complicated. Let's say you want the performance hours 7:00 till 9:00 from the following data:

id  run_hour               performance_hour      value
2   "2017-06-25 09:00:00"  "2017-06-25 06:00:00"    6
2   "2017-06-25 09:00:00"  "2017-06-25 10:00:00"    5

The expected result would be 18 (6 for 7:00 + 6 for 8:00 + 6 for 9:00) all based on the 6:00 record which itself is outside the desired time range.

We need a recursive CTE starting from the first wanted performance hour per id till the last wanted one. Thus we build records that don't exist and that we can sum up later.

with recursive cte(id, run_hour, performance_hour, value) as
(
  select *
  from
  (
    select distinct on (id) 
      id, 
      run_hour,
      greatest(performance_hour, timestamp '2017-06-25 07:00') as performance_hour, 
      value
    from metrics
    where run_hour = timestamp '2017-06-25 09:00' 
      and performance_hour <= timestamp '2017-06-25 07:00'
    order by id, metrics.performance_hour desc
  ) start_by_id
  union all
  select 
    cte.id, 
    cte.run_hour,
    cte.performance_hour + interval '1 hour' as performance_hour,
    coalesce(m.value, cte.value) as value
  from cte
  left join metrics m on m.id = cte.id
                      and m.run_hour = cte.run_hour
                      and m.performance_hour = cte.performance_hour + interval '1 hour'
  where cte.performance_hour < timestamp '2017-06-25 09:00'
)
select id, sum(value)
from cte
group by id;

Rextester link: http://rextester.com/PHC88770

Sign up to request clarification or add additional context in comments.

5 Comments

why the expected result will be 18 and not 11 for id: 2? I need the most recent run_time record if exists
Okay, I missed the part that you want the time till current hour. Yes, this would include the second record, of course. The question however remains how many previous hours. If you look at 7:00 to 10:00 you'll get a sum of 23. If you look at 9:00 to 10:00 you get a sum of 11. If you look at 9:00 to 12:00 you get 21. Just adjust the timestamps in the query so as to get the range you want.
I had a bug in my query, though. In the DISTINCT ON query I ordered by id, performance_hour instead of id, metrics.performance_hour. That could pick the wrong record due to ambiguity. Well, that comes from using an alias name that is also a column name :-| I've updated my answer and the rextester test accordingly.
Thanks. Do you think that there is a better db to save this type of information without the need to look recursively? rather than postgresSQL
Another DB or another DBMS? The database doesn't seem bad, but you are not writing every record. Though the data is regarded to come in hourly, you omit records when they have the same value as the previous. Thus your table has gaps. This makes your query so complicated; as a first step you must generate the missing rows (recursive query). If you want it easier, you'll have to write the records, so there are no gaps. Or use a programming language rather than SQL for calculation. As to your database management system: PostgreSQL is very good; I see no reason why you would change it for another.
1

You want the distinct on before the group by:

select id, sum(value)
from (select distinct on (id, run_hour) m.*
      from metrics m
      where run_hour <= '2017-06-25 09:00' and
            performance_hour >= '2017-06-25 07:00' and
            performance_hour < '2017-06-25 09:00'
      order by id, run_hour, performance_hour desc
     ) m
group by id;

4 Comments

I get missing FROM-clause entry for table "m"
Change m.* to metrics.*.
it doesn't work. I get for id: 2 the sum of 10 instead of 11
It takes the run of the 8:00 for id 2, however it should take the value of the run of 9:00 which returned 6

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.