In a table, I have plan details of customers with their customer_id and enroll_date.
Now, I want to identify duplicate and valid enrollments from the overall data.
Duplicate: If a customer enrolls a plan, and then enrolls for another plan within 30 (<=30) days, then the second plan is considered as duplicate. Others are valid.
Example:
| customer_id | enroll_date | type |
|---|---|---|
| 2 | 2024-02-05 | valid |
| 2 | 2024-02-25 | duplicate |
| 1 | 2024-03-05 | valid |
| 1 | 2024-05-21 | valid |
| 1 | 2024-06-09 | duplicate |
| 1 | 2024-06-21 | valid |
| 1 | 2024-07-04 | duplicate |
| 1 | 2024-07-16 | duplicate |
| 1 | 2024-07-26 | valid |
| 1 | 2024-08-03 | duplicate |
| 1 | 2024-08-17 | duplicate |
| 1 | 2024-09-05 | valid |
Explanation:
Consider customer with customer_id = 1
The plan enrolled on 2024-03-05 is valid since its the first plan made by that customer.
Then, the next plan that is enrolled by the same customer is on 2024-05-21. Since its after 30 days since plan on 2024-03-05, it is valid.
Then, the next plan is on 2024-06-09 which is less than 30 days after the valid plan, which was on 2024-05-21. so it is a duplicate.
Then, the next plan is on 2024-06-21 which is more than 30 days since last valid plan, which was on 2024-05-21. so its valid.
Then, the next plan is on 2024-07-04 which is less than 30 days after the valid plan, which was on 2024-06-21. so it is a duplicate.
Then, the next plan is on 2024-07-16 which is less than 30 days after the valid plan, which was on 2024-06-21. so it is a duplicate.
Then the next plan is on 2024-07-26 which is more than 30 days since last valid plan, which was on 2024-06-21. so its valid. and so on...
I tried using lag window function, but it did not work as expected because it was considering the previous date only for comparison, whereas we need to compare each enroll_date with previous valid enroll_date.
WITH enrollments AS (
SELECT
customer_id,
enroll_date,
CASE
WHEN enroll_date - LAG(enroll_date) OVER (
PARTITION BY customer_id ORDER BY enroll_date
) <= 30 THEN
LAG(enroll_date) OVER (
PARTITION BY customer_id ORDER BY enroll_date
)
ELSE enroll_date
END
AS enroll_fill
FROM plan
where enroll_date is not null
)
SELECT
customer_id,
enroll_date,
enroll_fill,
CASE
WHEN enroll_date = enroll_fill THEN 'valid'
ELSE 'duplicate'
END AS type
FROM enrollments
But this wasn't giving the expected output.
| customer_id | enroll_date | expected enroll_fill | expected type |
|---|---|---|---|
| 1 | 2024-02-01 | 2024-02-01 | valid |
| 1 | 2024-02-15 | 2024-02-01 | duplicate |
| 1 | 2024-03-05 | 2024-03-05 | valid |
| 1 | 2024-05-21 | 2024-05-21 | valid |
| 1 | 2024-06-09 | 2024-05-21 | duplicate |
| 1 | 2024-06-21 | 2024-06-21 | valid |
| 1 | 2024-07-04 | 2024-06-21 | duplicate |
| 1 | 2024-07-16 | 2024-06-21 | duplicate |
| 1 | 2024-07-26 | 2024-07-26 | valid |
| 1 | 2024-08-03 | 2024-07-26 | duplicate |
| 1 | 2024-08-17 | 2024-07-26 | duplicate |
| 1 | 2024-09-05 | 2024-09-05 | valid |
The output given by above query:
| customer_id | enroll_date | output enroll_fill | output type |
|---|---|---|---|
| 1 | 2024-02-01 | 2024-02-01 | valid |
| 1 | 2024-02-15 | 2024-02-01 | duplicate |
| 1 | 2024-03-05 | 2024-03-05 | valid |
| 1 | 2024-05-21 | 2024-05-21 | valid |
| 1 | 2024-06-09 | 2024-05-21 | duplicate |
| 1 | 2024-06-21 | 2024-06-09 | duplicate |
| 1 | 2024-07-04 | 2024-06-21 | duplicate |
| 1 | 2024-07-16 | 2024-07-04 | duplicate |
| 1 | 2024-07-26 | 2024-07-16 | duplicate |
| 1 | 2024-08-03 | 2024-07-26 | duplicate |
| 1 | 2024-08-17 | 2024-08-03 | duplicate |
| 1 | 2024-09-05 | 2024-08-17 | duplicate |
Kindly suggest if there are any other ways to do this.