1

I have a table looks like this

CREATE TABLE myTable
(
    userid      text,
    webid         text,
    "timestamp" timestamp
);

INSERT INTO myTable
    ("userid", "webid", "timestamp")
VALUES ('A', '34', '2023-01-31 17:34:49.000'),
       ('A', '73', '2023-01-31 17:34:50.000'),
       ('A', '97', '2023-01-31 17:34:58.000'),
       ('A', '17', '2023-01-31 17:35:02.000'),
       ('A', '17', '2023-01-31 17:35:07.000'),
       ('A', '17', '2023-01-31 17:35:18.000'),
       ('A', '17', '2023-01-31 17:35:30.000'),
       ('A', '1', '2023-01-31 17:35:37.000'),
       ('A', '1', '2023-01-31 17:35:38.000'),
       ('A', '77', '2023-01-31 17:35:41.000'),
       ('A', '77', '2023-01-31 17:35:42.000'),
       ('A', '1', '2023-01-31 17:37:10.000'),
       ('A', '1', '2023-01-31 17:37:12.000'),
       ('A', '77', '2023-01-31 17:37:14.000'),
       ('A', '77', '2023-01-31 17:37:15.000'),
       ('B', '33', '2023-01-31 06:37:15.000'),
       ('B', '56', '2023-01-31 06:40:15.000')
  ;

userid webid timestamp
A 34 2023-01-31 17:34:49
A 73 2023-01-31 17:34:50
A 97 2023-01-31 17:34:58
A 17 2023-01-31 17:35:02
A 17 2023-01-31 17:35:07
A 17 2023-01-31 17:35:18
A 17 2023-01-31 17:35:30
A 1 2023-01-31 17:35:37
A 1 2023-01-31 17:35:38
A 77 2023-01-31 17:35:41
A 77 2023-01-31 17:35:42
A 1 2023-01-31 17:37:10
A 1 2023-01-31 17:37:12
A 77 2023-01-31 17:37:14
A 77 2023-01-31 17:37:15
B 33 2023-01-31 06:37:15
B 56 2023-01-31 06:40:15

The correct output I want is this:

userid webid timestamp first_ts
A 34 2023-01-31 17:34:49 2023-01-31 17:34:49
A 73 2023-01-31 17:34:50 2023-01-31 17:34:50
A 97 2023-01-31 17:34:58 2023-01-31 17:34:58
A 17 2023-01-31 17:35:02 2023-01-31 17:35:02
A 17 2023-01-31 17:35:07 2023-01-31 17:35:02
A 17 2023-01-31 17:35:18 2023-01-31 17:35:02
A 17 2023-01-31 17:35:30 2023-01-31 17:35:02
A 1 2023-01-31 17:35:37 2023-01-31 17:35:37
A 1 2023-01-31 17:35:38 2023-01-31 17:35:37
A 77 2023-01-31 17:35:41 2023-01-31 17:35:41
A 77 2023-01-31 17:35:42 2023-01-31 17:35:41
A 1 2023-01-31 17:37:10 2023-01-31 17:37:10
A 1 2023-01-31 17:37:12 2023-01-31 17:37:10
A 77 2023-01-31 17:37:14 2023-01-31 17:37:14
A 77 2023-01-31 17:37:15 2023-01-31 17:37:14
B 33 2023-01-31 06:37:15 2023-01-31 06:37:15
B 56 2023-01-31 06:40:15 2023-01-31 06:40:15

first_ts means first timestamp. If a userid visits consecutive webid, their first_ts should be the first timestamp of this webid. If the userid visits webid let's say webid = 77 and jumps to another webid = 1 and then jumps back to webid = 77, the first_ts should be refreshed, it won't be the very first one. I used following code to do that, but unable to get correct first_ts for userid who jumps back to previous webid.

select userid, webid, 
lag(webid,1) over(partition by userid order by timestamp desc) as next_webid,
lag(webid,1) over(partition by userid order by timestamp asc) as previous_webid,
timestamp,
case when webid = next_webid then first_value(timestamp) over(partition by userid, webid order by timestamp) 
     when webid = next_webid and webid = previous_webid then first_value(timestamp) over(partition by userid, webid order by timestamp) 
     when webid = previous_webid then first_value(timestamp) over(partition by userid, webid order by timestamp) 
     else timestamp
end as first_ts
from myTable 
order by userid, timestamp;

this wrong query will return output

userid webid timestamp first_ts
A 34 2023-01-31 17:34:49 2023-01-31 17:34:49
A 73 2023-01-31 17:34:50 2023-01-31 17:34:50
A 97 2023-01-31 17:34:58 2023-01-31 17:34:58
A 17 2023-01-31 17:35:02 2023-01-31 17:35:02
A 17 2023-01-31 17:35:07 2023-01-31 17:35:02
A 17 2023-01-31 17:35:18 2023-01-31 17:35:02
A 17 2023-01-31 17:35:30 2023-01-31 17:35:02
A 1 2023-01-31 17:35:37 2023-01-31 17:35:37
A 1 2023-01-31 17:35:38 2023-01-31 17:35:37
A 77 2023-01-31 17:35:41 2023-01-31 17:35:41
A 77 2023-01-31 17:35:42 2023-01-31 17:35:41
A 1 2023-01-31 17:37:10 2023-01-31 17:35:37
A 1 2023-01-31 17:37:12 2023-01-31 17:35:37
A 77 2023-01-31 17:37:14 2023-01-31 17:35:41
A 77 2023-01-31 17:37:15 2023-01-31 17:35:41
B 33 2023-01-31 06:37:15 2023-01-31 06:37:15
B 56 2023-01-31 06:40:15 2023-01-31 06:40:15

in row 12, 13, 14, 15, the first_ts is wrong. Anyone knows how to modify the query to get correct first_ts? Thanks!

1 Answer 1

1

This is a gaps and islands problem. One solution uses the difference in row numbers method:

WITH cte AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY timestamp) rn1,
              ROW_NUMBER() OVER (PARTITION BY userid, webid ORDER BY timestamp) rn2
    FROM myTable
)

SELECT userid, webid, timestamp,
       MIN(timestamp) OVER (PARTITION BY userid, webid, rn1 - rn2) AS first_ts
FROM cte
ORDER BY userid, timestamp;

The idea here is to use the row number difference to form a pseudo group, which is defined by the userid, webid, and rn1 - rn2. We then take the minimum timestamp for each such group/island to find the first_ts. To get a better understanding of how difference in row numbers works, just run SELECT * FROM cte and check the results.

Sign up to request clarification or add additional context in comments.

3 Comments

hi, I am trying to cooperate 30 minutes rule on first_st, there will be one corner case like the example here won't pass. The example is here: dbfiddle.uk/D4xqViDn I use your query to cooperate 30 minutes rule, the row 17 will have wrong first_ts. do you know how to modify the code?
30 minute rule totally changes the question and invalidates my answer. I suggest asking a new question.
stackoverflow.com/questions/75692009/… I post right here. if you can help, very appreciate it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.