1

I have a table containing the fields: user_ip, datetime, year, month, day, hour, tag_id, country, device_type, brand. I need to check if given a IP was active for a continuous period of 4 or more hours and flag as true / false.

  • A single IP can have multiple records within the same hour with different timestamp. I need to identify IPs that were continuously active for at least 4 hours.
  • for hour (0,1,2) we need to check previous days last few hours to account for atleast 4 consecutive hours
  • minutes are to be ignored (thus 21:59 is accounted as 21:00, while 22:01 is 22:00: those 2-minutes apart timestamps are considered covering 2 consecutive hours, as would 21:01 and 22:59)

Query tried:

WITH sample_data (user_ip, datetime, year, month, day, hour, tag_id, country, device_type, brand) AS
(
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525000000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 0, 1001, 'US', 1, 'Samsung' union all
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525010000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 1, 1001, 'US', 1, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525020000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 2, 1001, 'US', 1, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525030000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 3, 1001, 'US', 1, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525040000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 4, 1001, 'US', 2, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525050000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 5, 1002, 'US', 2, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525060000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 6, 1002, 'US', 2, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525070000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 7, 1002, 'US', 2, 'Samsung' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525080000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 8, 1002, 'US', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525090000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 9, 1003, 'US', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525100000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 10, 1003, 'US', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525110000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 11, 1003, 'US', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525215100', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 21, 1004, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525215200', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 21, 1005, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525215300', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 21, 1006, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525215400', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 21, 1007, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525215500', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 21, 1008, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525220000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 22, 1004, 'CA', 3, 'Apple' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250525230000', 'YYYYMMDDHH24MISS'), 2025, 5, 25, 23, 1004, 'CA', 4, 'LG' union all 
  
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526000000', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 0, 1005, 'CA', 4, 'LG' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526010000', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 1, 1005, 'CA', 4, 'LG' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526020000', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 2, 1005, 'CA', 4, 'LG' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526030000', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 3, 1006, 'CA', 5, 'Sony' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526040000', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 4, 1006, 'CA', 5, 'Sony' union all 

  SELECT '192.168.0.101', TO_TIMESTAMP('20250526085100', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 8, 1006, 'CA', 5, 'Sony' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526085200', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 8, 1007, 'CA', 5, 'Sony' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526085300', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 8, 1008, 'CA', 5, 'Sony' union all 
  SELECT '192.168.0.101', TO_TIMESTAMP('20250526085400', 'YYYYMMDDHH24MISS'), 2025, 5, 26, 8, 1009, 'CA', 5, 'Sony' 
  )
,
grid AS 
(
    SELECT user_ip,datetime, year,month,day,hour, 
                Coalesce( Case When LAG(hour) Over(Partition By user_ip Order By datetime, hour) IN (hour, hour - 1)
                            OR Min(hour) Over(Partition By user_ip) = hour 
                            Then 'Y' 
                    End, Cast(hour as Varchar(2))
                    ) as consequtive_hour,
                hour - LAG(hour) Over(Partition By user_ip Order By datetime, hour) as diff_hours,
                Row_Number() Over(Partition By user_ip Order By datetime, hour
                                Rows Between Unbounded Preceding And Current Row) as rn,
        tag_id,country,device_type,brand
    from sample_data
)

Select user_ip,datetime, year,month,day,hour, 
       Case When Count(consequtive_hour) 
                       Over(Partition By user_ip, consequtive_hour
                            Order By datetime, hour
                            Rows Between Unbounded Preceding And Current Row) >= 4
                AND consequtive_hour = 'Y' And Sum(Case When consequtive_hour = 'Y' Then diff_hours Else 0 End) Over(Partition By user_ip, consequtive_hour
                            Order By datetime, hour
                            Rows Between Unbounded Preceding And Current Row) >= 3
            Then 'TRUE'
       Else 'FALSE'
       End as is_active_min_4hr, 
       consequtive_hour, diff_hours,rn,tag_id,country,device_type,brand
From grid 
Order By user_ip, datetime, hour, rn

fiddle : https://dbfiddle.uk/Q0vyCsJN

Expected result:

  • is_active_min_4hr : should be 'False' for Record 14-17 (hour=21), given hour 21 is not continuous due to no activity for hour 18,19,20
  • is_active_min_4hr : should be 'False' for Record 18 (hour=22) , due to no activity for hour 19,20
  • is_active_min_4hr : should be 'False' for Record 19 (hour=23) , due to no activity for hour 20
  • is_active_min_4hr : should be 'TRUE' for Record 20-24 (hour 0-4 next day) , due to continuous activity . for ex: hour 00 (2025-05-26) as continuous activity from prior ours (21,22,23 previous day 20250526)
5
  • Do columns tag_id, country, device_type, brand have matter to the question(answer)? Commented May 29 at 13:49
  • Can you confirm that "after 4 hours" means 4 hours after truncating the minutes (thus 21:51:00 is counted as 21:00:00)? Because the intuitive meaning would be 4 "real" hours, which would prevent 00:00:00 from being flagged (because 4 hours after 21:51:00 would only have elapsed at 00:51:00, thus only 01:00:00 would be the first flagged timestamp). Note that @d r's answer is right in that second meaning (see his results table: 00:00:00 is flagged false). Commented May 29 at 18:05
  • @ValNik yes, we need those attributes Commented May 30 at 5:21
  • @Guillaume Outters just hour part, without any minutes. not counting from minutes part onwards Commented May 30 at 5:22
  • 1
    OK, I adapted my answer by replacing every datetime by a date_trunc('hour', datetime) (and this would work equally well for @d-r's answer). Note that to properly test for this subtlety, your test should stress this rule, by transforming 2025-05-25 03:00 to 2025-05-25 03:33: for now your only test having minutes is 21:51, which only has to compare to its successor 22:00 (always >= 22:00 - 1:00, be it complete with minutes or truncated to 21:00), while 3:33, having both a predecessor '2:00` and successor 4:00, would test both a less-than-1 h interval and a more-than-1 h one. Commented May 30 at 21:02

2 Answers 2

1

Try it using datetime and intervals like below:
1. CTE grid

  • creates an unique column rn ( Row_Number() Over() )
  • calculates interval between current and previous connection (hour_diff column)
WITH 
grid AS 
  ( SELECT     user_ip, datetime, year, month, day, hour, 
               tag_id, country, device_type, brand, 
               Row_Number() Over(Partition By user_ip Order By datetime) as rn,
               datetime - LAG(datetime) Over(Partition By user_ip Order By datetime) as hour_diff
    FROM       sample_data
  ), 

2. CTE grps

  • defining start of consecutive groups of rows putting unique rn in starting rows ( grp_start_rn column )
  • resets hour_diff on each group starting row ( grp_hour_diff column)
grps AS
  ( Select   g.*, 
             Case When Extract( Hour From g.datetime - LAG(g.datetime) Over(Partition By user_ip Order By datetime) ) > 1
                      OR g.hour_diff Is Null
                  Then rn
             End grp_start_rn ,
        CASE WHEN 
                  Case When Extract( Hour From g.datetime - LAG(g.datetime) Over(Partition By user_ip Order By datetime) ) > 1
                             OR g.hour_diff Is Null
                       Then rn
                  End Is Not Null
           THEN '1 Hour':: Interval
       ELSE g.hour_diff
       END as grp_hour_diff
   From         grid g
   Order By     g.rn
 )

3. Main SQL

  • sumarizing intervals per group (total_grp_hours column) and if the sum is 4 hours or more decides about flag column
  • subquery fills in missing the group defining rn from the group's starting row --
SELECT    x.user_ip, x.datetime, x.year, x.month, x.day, x.hour, 
          Sum(Coalesce(x.grp_hour_diff, '1 Hour'::Interval)) 
                      Over(Partition By x.user_ip, x.grp Order By x.datetime
                      Rows Between Unbounded Preceding And Current Row) as total_grp_hours, 
    CASE WHEN
            Sum(Coalesce(x.grp_hour_diff, '1 Hour'::Interval)) 
                      Over(Partition By x.user_ip, x.grp Order By x.datetime
                      Rows Between Unbounded Preceding And Current Row) >= '4 Hour'::Interval
         THEN 'TRUE'
   ELSE 'False'
   END as flag, 
   x.grp, x.rn
FROM    ( Select      g.*, Max(g.grp_start_rn) Over(Partition By user_ip Order By datetime
                              Rows Between Unbounded Preceding And Current Row) as grp
          From        grps g
       ) x


user_ip datetime year month day hour total_grp_hours flag grp rn
192.168.0.101 2025-05-25 00:00:00+00 2025 5 25 0 01:00:00 False 1 1
192.168.0.101 2025-05-25 01:00:00+00 2025 5 25 1 02:00:00 False 1 2
192.168.0.101 2025-05-25 02:00:00+00 2025 5 25 2 03:00:00 False 1 3
192.168.0.101 2025-05-25 03:00:00+00 2025 5 25 3 04:00:00 TRUE 1 4
192.168.0.101 2025-05-25 04:00:00+00 2025 5 25 4 05:00:00 TRUE 1 5
192.168.0.101 2025-05-25 05:00:00+00 2025 5 25 5 06:00:00 TRUE 1 6
192.168.0.101 2025-05-25 06:00:00+00 2025 5 25 6 07:00:00 TRUE 1 7
192.168.0.101 2025-05-25 07:00:00+00 2025 5 25 7 08:00:00 TRUE 1 8
192.168.0.101 2025-05-25 08:00:00+00 2025 5 25 8 09:00:00 TRUE 1 9
192.168.0.101 2025-05-25 09:00:00+00 2025 5 25 9 10:00:00 TRUE 1 10
192.168.0.101 2025-05-25 10:00:00+00 2025 5 25 10 11:00:00 TRUE 1 11
192.168.0.101 2025-05-25 11:00:00+00 2025 5 25 11 12:00:00 TRUE 1 12
192.168.0.101 2025-05-25 21:51:00+00 2025 5 25 21 01:00:00 False 13 13
192.168.0.101 2025-05-25 21:52:00+00 2025 5 25 21 01:01:00 False 13 14
192.168.0.101 2025-05-25 21:53:00+00 2025 5 25 21 01:02:00 False 13 15
192.168.0.101 2025-05-25 21:54:00+00 2025 5 25 21 01:03:00 False 13 16
192.168.0.101 2025-05-25 21:55:00+00 2025 5 25 21 01:04:00 False 13 17
192.168.0.101 2025-05-25 22:00:00+00 2025 5 25 22 01:09:00 False 13 18
192.168.0.101 2025-05-25 23:00:00+00 2025 5 25 23 02:09:00 False 13 19
192.168.0.101 2025-05-26 00:00:00+00 2025 5 26 0 03:09:00 False 13 20
192.168.0.101 2025-05-26 01:00:00+00 2025 5 26 1 04:09:00 TRUE 13 21
192.168.0.101 2025-05-26 02:00:00+00 2025 5 26 2 05:09:00 TRUE 13 22
192.168.0.101 2025-05-26 03:00:00+00 2025 5 26 3 06:09:00 TRUE 13 23
192.168.0.101 2025-05-26 04:00:00+00 2025 5 26 4 07:09:00 TRUE 13 24
192.168.0.101 2025-05-26 08:51:00+00 2025 5 26 8 01:00:00 False 25 25
192.168.0.101 2025-05-26 08:52:00+00 2025 5 26 8 01:01:00 False 25 26
192.168.0.101 2025-05-26 08:53:00+00 2025 5 26 8 01:02:00 False 25 27
192.168.0.101 2025-05-26 08:54:00+00 2025 5 26 8 01:03:00 False 25 28

fiddle

Sign up to request clarification or add additional context in comments.

7 Comments

I note that your answer is right within the usual meaning of "after 4 hours": you don't flag 00:00:00 because the first timestamp occurred at 21:51:00 (thus only 3 hours and 9 minutes before, not 4 hours), contrary to what OP expected due to his or her unprecise specification.
@GuillaumeOutters Exactly. This is just a draft of a possible solution the OP should adjust it to the actual context and logic. That shouldn't be to complicated to change the Case expressions and get the other result. The structure is all here.
@user16798185 There sre no corelated subqueries here, not any kind of joins at all - just two step data preprocessing. Also, in step 1, if you have a unique row identifier, you don't need the rn column at all - just use the row id instead. Performance will depend on the size of the table, but we don't know your database structure, description, indexes, relations nor dependences to make any estimations of the kind.
@user16798185 Didn't test it, but it looks like an option that could be as good as mine. Again, without an insight into your context, it is impossible to tell for sure. Just think mine is more straight forward and more readable - not beter - you are the one to test the punctuality and the performance.
I consider @d-r's solution more general, that is, it has a functional advantage (can work with full timestamps, or truncated ones by replacing all datetime with date_trunc('hour', datetime)) while mine can only work with truncated timestamps (it's nearly a hack… which makes its beauty, and the resulting simplicity). From a performance point of view, as long as both have to date_trunc I think they will not be optimal (but if you could post explain analyze for both it would be interesting). But if you give up date_trunc a windowed approach like @d-r's may benefit from indexes.
|
1

If, as I understand from your description, you don't mind minutes making interstices of more than 1 hour (but less than 2 hours: for example in 13:00 - 14:15 - 15:00, we have the 13 - 14 - 15 serie but instead of two interstices of 1 h we've got one of 1 hour 15 minutes and one of 45 minutes), as long as the extremities are less than 4 hours apart,
a quick method is ensuring that we have activity for each of the 4 distinct hours preceding an activity to flag,
that is, that

COUNT(DISTINCT DATE_TRUNC('hour', datetime))
OVER (PARTITION BY user_ip ORDER BY datetime RANGE BETWEEN INTERVAL '4 hours' PRECEDING AND CURRENT ROW)
= 5 -- 5 instead of 4 because we're counting the current row in the window

Sadly PostgreSQL doesn't allow COUNT(DISTINCT) OVER, but we can easily emulate it with a correlated sub-SELECT:

SELECT
    *,
    (
        SELECT COUNT(DISTINCT DATE_TRUNC('hour', datetime))
        FROM sample_data prev
        WHERE prev.user_ip = sample_data.user_ip
        AND DATE_TRUNC('hour', prev.datetime) BETWEEN DATE_TRUNC('hour', sample_data.datetime) - INTERVAL '3 hours' AND DATE_TRUNC('hour', sample_data.datetime)
    ) = 4 flagged
FROM sample_data;

I've put it in a fiddle for demo purpose.

3 Comments

great, works as expected. Initially I considered count(distinct) in analytical, as that was not supported, was looking for other ways. Just curios, what are the performance implication of this approach for larger dataset - especially using table scan (during flagged with sample_data prev) part of select itself?
is there any permanence bottleneck with using correlated sub-SELECT on large data volume?
I really couldn't tell. By intuition I would say that, from a performance point of view, my solution is quite catastrophic, due to each row having its own subselect, and moreover each subselect doesn't work with indexed columns, but with transformed ones (DATE_TRUNC): I targeted extreme conciseness and readability, but a first performance-oriented step could be to have a CTE that computes DATE_TRUNC once for all (and then use it as a source both for sample_data and prev).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.