Consecutive Activity using Analytical Function

Question

I have a table containing the fields: user_ip, datetime, year, month, day, hour, tag_id, country, device_type, brand. I need to check if given a IP was active for a continuous period of 4 or more hours and flag as true / false.

A single IP can have multiple records within the same hour with different timestamp. I need to identify IPs that were continuously active for at least 4 hours.
for hour (0,1,2) we need to check previous days last few hours to account for atleast 4 consecutive hours
minutes are to be ignored (thus 21:59 is accounted as 21:00, while 22:01 is 22:00: those 2-minutes apart timestamps are considered covering 2 consecutive hours, as would 21:01 and 22:59)

Query tried:

WITH sample_data (user_ip, datetime, year, month, day, hour, tag_id, country, device_type, brand) AS
(
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525000000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 0, 1001, 'US', 1, 'Samsung' union all
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525010000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 1, 1001, 'US', 1, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525020000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 2, 1001, 'US', 1, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525030000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 3, 1001, 'US', 1, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525040000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 4, 1001, 'US', 2, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525050000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 5, 1002, 'US', 2, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525060000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 6, 1002, 'US', 2, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525070000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 7, 1002, 'US', 2, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525080000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 8, 1002, 'US', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525090000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 9, 1003, 'US', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525100000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 10, 1003, 'US', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525110000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 11, 1003, 'US', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525215100', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 21, 1004, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525215200', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 21, 1005, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525215300', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 21, 1006, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525215400', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 21, 1007, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525215500', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 21, 1008, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525220000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 22, 1004, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525230000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 23, 1004, 'CA', 4, 'LG' union all 
  
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526000000', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 0, 1005, 'CA', 4, 'LG' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526010000', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 1, 1005, 'CA', 4, 'LG' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526020000', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 2, 1005, 'CA', 4, 'LG' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526030000', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 3, 1006, 'CA', 5, 'Sony' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526040000', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 4, 1006, 'CA', 5, 'Sony' union all 

  SELECT '192.168.0.101', TO_TIMESTAMP('20250526085100', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 8, 1006, 'CA', 5, 'Sony' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526085200', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 8, 1007, 'CA', 5, 'Sony' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526085300', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 8, 1008, 'CA', 5, 'Sony' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526085400', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 8, 1009, 'CA', 5, 'Sony' 
  )
,
grid AS 
(
    SELECT user_ip,datetime, year,month,day,hour, 
                Coalesce( Case When LAG(hour) Over(Partition By user_ip Order By datetime, hour) IN (hour, hour - 1)
                            OR Min(hour) Over(Partition By user_ip) = hour 
                            Then 'Y' 
                    End, Cast(hour as Varchar(2))
                    ) as consequtive_hour,
                hour - LAG(hour) Over(Partition By user_ip Order By datetime, hour) as diff_hours,
                Row_Number() Over(Partition By user_ip Order By datetime, hour
                                Rows Between Unbounded Preceding And Current Row) as rn,
        tag_id,country,device_type,brand
    from sample_data
)

Select user_ip,datetime, year,month,day,hour, 
       Case When Count(consequtive_hour) 
                       Over(Partition By user_ip, consequtive_hour
                            Order By datetime, hour
                            Rows Between Unbounded Preceding And Current Row) >= 4
                AND consequtive_hour = 'Y' And Sum(Case When consequtive_hour = 'Y' Then diff_hours Else 0 End) Over(Partition By user_ip, consequtive_hour
                            Order By datetime, hour
                            Rows Between Unbounded Preceding And Current Row) >= 3
            Then 'TRUE'
       Else 'FALSE'
       End as is_active_min_4hr, 
       consequtive_hour, diff_hours,rn,tag_id,country,device_type,brand
From grid 
Order By user_ip, datetime, hour, rn

fiddle : https://dbfiddle.uk/Q0vyCsJN

Expected result:

is_active_min_4hr : should be 'False' for Record 14-17 (hour=21), given hour 21 is not continuous due to no activity for hour 18,19,20
is_active_min_4hr : should be 'False' for Record 18 (hour=22) , due to no activity for hour 19,20
is_active_min_4hr : should be 'False' for Record 19 (hour=23) , due to no activity for hour 20
is_active_min_4hr : should be 'TRUE' for Record 20-24 (hour 0-4 next day) , due to continuous activity . for ex: hour 00 (2025-05-26) as continuous activity from prior ours (21,22,23 previous day 20250526)

Do columns tag_id, country, device_type, brand have matter to the question(answer)? — ValNik
– ValNik, Commented May 29 at 13:49
Can you confirm that "after 4 hours" means 4 hours after truncating the minutes (thus 21:51:00 is counted as 21:00:00)? Because the intuitive meaning would be 4 "real" hours, which would prevent 00:00:00 from being flagged (because 4 hours after 21:51:00 would only have elapsed at 00:51:00, thus only 01:00:00 would be the first flagged timestamp). Note that @d r's answer is right in that second meaning (see his results table: 00:00:00 is flagged false). — Guillaume Outters
– Guillaume Outters, Commented May 29 at 18:05
@Guillaume Outters just hour part, without any minutes. not counting from minutes part onwards — user16798185
– user16798185, Commented May 30 at 5:22
OK, I adapted my answer by replacing every datetime by a date_trunc('hour', datetime) (and this would work equally well for @d-r's answer). Note that to properly test for this subtlety, your test should stress this rule, by transforming 2025-05-25 03:00 to 2025-05-25 03:33: for now your only test having minutes is 21:51, which only has to compare to its successor 22:00 (always >= 22:00 - 1:00, be it complete with minutes or truncated to 21:00), while 3:33, having both a predecessor '2:00` and successor 4:00, would test both a less-than-1 h interval and a more-than-1 h one. — Guillaume Outters
– Guillaume Outters, Commented May 30 at 21:02

d r · Accepted Answer · 2025-05-29 11:05:22Z

1

Try it using datetime and intervals like below:
1. CTE grid

creates an unique column rn ( Row_Number() Over() )
calculates interval between current and previous connection (hour_diff column)

WITH 
grid AS 
  ( SELECT     user_ip, datetime, year, month, day, hour, 
               tag_id, country, device_type, brand, 
               Row_Number() Over(Partition By user_ip Order By datetime) as rn,
               datetime - LAG(datetime) Over(Partition By user_ip Order By datetime) as hour_diff
    FROM       sample_data
  ),

2. CTE grps

defining start of consecutive groups of rows putting unique rn in starting rows ( grp_start_rn column )
resets hour_diff on each group starting row ( grp_hour_diff column)

grps AS
  ( Select   g.*, 
             Case When Extract( Hour From g.datetime - LAG(g.datetime) Over(Partition By user_ip Order By datetime) ) > 1
                      OR g.hour_diff Is Null
                  Then rn
             End grp_start_rn ,
        CASE WHEN 
                  Case When Extract( Hour From g.datetime - LAG(g.datetime) Over(Partition By user_ip Order By datetime) ) > 1
                             OR g.hour_diff Is Null
                       Then rn
                  End Is Not Null
           THEN '1 Hour':: Interval
       ELSE g.hour_diff
       END as grp_hour_diff
   From         grid g
   Order By     g.rn
 )

3. Main SQL

sumarizing intervals per group (total_grp_hours column) and if the sum is 4 hours or more decides about flag column
subquery fills in missing the group defining rn from the group's starting row --

SELECT    x.user_ip, x.datetime, x.year, x.month, x.day, x.hour, 
          Sum(Coalesce(x.grp_hour_diff, '1 Hour'::Interval)) 
                      Over(Partition By x.user_ip, x.grp Order By x.datetime
                      Rows Between Unbounded Preceding And Current Row) as total_grp_hours, 
    CASE WHEN
            Sum(Coalesce(x.grp_hour_diff, '1 Hour'::Interval)) 
                      Over(Partition By x.user_ip, x.grp Order By x.datetime
                      Rows Between Unbounded Preceding And Current Row) >= '4 Hour'::Interval
         THEN 'TRUE'
   ELSE 'False'
   END as flag, 
   x.grp, x.rn
FROM    ( Select      g.*, Max(g.grp_start_rn) Over(Partition By user_ip Order By datetime
                              Rows Between Unbounded Preceding And Current Row) as grp
          From        grps g
       ) x

user_ip	datetime	year	month	day	hour	total_grp_hours	flag	grp	rn
192.168.0.101	2025-05-25 00:00:00+00	2025	5	25	0	01:00:00	False	1	1
192.168.0.101	2025-05-25 01:00:00+00	2025	5	25	1	02:00:00	False	1	2
192.168.0.101	2025-05-25 02:00:00+00	2025	5	25	2	03:00:00	False	1	3
192.168.0.101	2025-05-25 03:00:00+00	2025	5	25	3	04:00:00	TRUE	1	4
192.168.0.101	2025-05-25 04:00:00+00	2025	5	25	4	05:00:00	TRUE	1	5
192.168.0.101	2025-05-25 05:00:00+00	2025	5	25	5	06:00:00	TRUE	1	6
192.168.0.101	2025-05-25 06:00:00+00	2025	5	25	6	07:00:00	TRUE	1	7
192.168.0.101	2025-05-25 07:00:00+00	2025	5	25	7	08:00:00	TRUE	1	8
192.168.0.101	2025-05-25 08:00:00+00	2025	5	25	8	09:00:00	TRUE	1	9
192.168.0.101	2025-05-25 09:00:00+00	2025	5	25	9	10:00:00	TRUE	1	10
192.168.0.101	2025-05-25 10:00:00+00	2025	5	25	10	11:00:00	TRUE	1	11
192.168.0.101	2025-05-25 11:00:00+00	2025	5	25	11	12:00:00	TRUE	1	12
192.168.0.101	2025-05-25 21:51:00+00	2025	5	25	21	01:00:00	False	13	13
192.168.0.101	2025-05-25 21:52:00+00	2025	5	25	21	01:01:00	False	13	14
192.168.0.101	2025-05-25 21:53:00+00	2025	5	25	21	01:02:00	False	13	15
192.168.0.101	2025-05-25 21:54:00+00	2025	5	25	21	01:03:00	False	13	16
192.168.0.101	2025-05-25 21:55:00+00	2025	5	25	21	01:04:00	False	13	17
192.168.0.101	2025-05-25 22:00:00+00	2025	5	25	22	01:09:00	False	13	18
192.168.0.101	2025-05-25 23:00:00+00	2025	5	25	23	02:09:00	False	13	19
192.168.0.101	2025-05-26 00:00:00+00	2025	5	26	0	03:09:00	False	13	20
192.168.0.101	2025-05-26 01:00:00+00	2025	5	26	1	04:09:00	TRUE	13	21
192.168.0.101	2025-05-26 02:00:00+00	2025	5	26	2	05:09:00	TRUE	13	22
192.168.0.101	2025-05-26 03:00:00+00	2025	5	26	3	06:09:00	TRUE	13	23
192.168.0.101	2025-05-26 04:00:00+00	2025	5	26	4	07:09:00	TRUE	13	24
192.168.0.101	2025-05-26 08:51:00+00	2025	5	26	8	01:00:00	False	25	25
192.168.0.101	2025-05-26 08:52:00+00	2025	5	26	8	01:01:00	False	25	26
192.168.0.101	2025-05-26 08:53:00+00	2025	5	26	8	01:02:00	False	25	27
192.168.0.101	2025-05-26 08:54:00+00	2025	5	26	8	01:03:00	False	25	28

fiddle

edited May 29 at 11:05

answered May 29 at 9:45

d r

8,6622 gold badges9 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Guillaume Outters May 29 at 18:10

I note that your answer is right within the usual meaning of "after 4 hours": you don't flag 00:00:00 because the first timestamp occurred at 21:51:00 (thus only 3 hours and 9 minutes before, not 4 hours), contrary to what OP expected due to his or her unprecise specification.

d r May 29 at 18:47

@GuillaumeOutters Exactly. This is just a draft of a possible solution the OP should adjust it to the actual context and logic. That shouldn't be to complicated to change the Case expressions and get the other result. The structure is all here.

d r May 30 at 10:46

@user16798185 There sre no corelated subqueries here, not any kind of joins at all - just two step data preprocessing. Also, in step 1, if you have a unique row identifier, you don't need the rn column at all - just use the row id instead. Performance will depend on the size of the table, but we don't know your database structure, description, indexes, relations nor dependences to make any estimations of the kind.

d r May 30 at 10:56

@user16798185 Didn't test it, but it looks like an option that could be as good as mine. Again, without an insight into your context, it is impossible to tell for sure. Just think mine is more straight forward and more readable - not beter - you are the one to test the punctuality and the performance.

Guillaume Outters May 30 at 21:09

I consider @d-r's solution more general, that is, it has a functional advantage (can work with full timestamps, or truncated ones by replacing all datetime with date_trunc('hour', datetime)) while mine can only work with truncated timestamps (it's nearly a hack… which makes its beauty, and the resulting simplicity). From a performance point of view, as long as both have to date_trunc I think they will not be optimal (but if you could post explain analyze for both it would be interesting). But if you give up date_trunc a windowed approach like @d-r's may benefit from indexes.

|

Guillaume Outters · Accepted Answer · 2025-05-30 07:34:38Z

1

If, as I understand from your description, you don't mind minutes making interstices of more than 1 hour (but less than 2 hours: for example in 13:00 - 14:15 - 15:00, we have the 13 - 14 - 15 serie but instead of two interstices of 1 h we've got one of 1 hour 15 minutes and one of 45 minutes), as long as the extremities are less than 4 hours apart,
a quick method is ensuring that we have activity for each of the 4 distinct hours preceding an activity to flag,
that is, that

COUNT(DISTINCT DATE_TRUNC('hour', datetime))
OVER (PARTITION BY user_ip ORDER BY datetime RANGE BETWEEN INTERVAL '4 hours' PRECEDING AND CURRENT ROW)
= 5 -- 5 instead of 4 because we're counting the current row in the window

Sadly PostgreSQL doesn't allow COUNT(DISTINCT) OVER, but we can easily emulate it with a correlated sub-SELECT:

SELECT
    *,
    (
        SELECT COUNT(DISTINCT DATE_TRUNC('hour', datetime))
        FROM sample_data prev
        WHERE prev.user_ip = sample_data.user_ip
        AND DATE_TRUNC('hour', prev.datetime) BETWEEN DATE_TRUNC('hour', sample_data.datetime) - INTERVAL '3 hours' AND DATE_TRUNC('hour', sample_data.datetime)
    ) = 4 flagged
FROM sample_data;

I've put it in a fiddle for demo purpose.

edited May 30 at 7:34

answered May 29 at 20:59

Guillaume Outters

7,5041 gold badge22 silver badges28 bronze badges

3 Comments

user16798185 May 30 at 7:11

great, works as expected. Initially I considered count(distinct) in analytical, as that was not supported, was looking for other ways. Just curios, what are the performance implication of this approach for larger dataset - especially using table scan (during flagged with sample_data prev) part of select itself?

user16798185 May 30 at 9:47

is there any permanence bottleneck with using correlated sub-SELECT on large data volume?

Guillaume Outters May 30 at 20:33

I really couldn't tell. By intuition I would say that, from a performance point of view, my solution is quite catastrophic, due to each row having its own subselect, and moreover each subselect doesn't work with indexed columns, but with transformed ones (DATE_TRUNC): I targeted extreme conciseness and readability, but a first performance-oriented step could be to have a CTE that computes DATE_TRUNC once for all (and then use it as a source both for sample_data and prev).

Collectives™ on Stack Overflow

Consecutive Activity using Analytical Function

2 Answers 2

7 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related