2

In a table, I have plan details of customers with their customer_id and enroll_date.

Now, I want to identify duplicate and valid enrollments from the overall data.

Duplicate: If a customer enrolls a plan, and then enrolls for another plan within 30 (<=30) days, then the second plan is considered as duplicate. Others are valid.

Example:

customer_id enroll_date type
2 2024-02-05 valid
2 2024-02-25 duplicate
1 2024-03-05 valid
1 2024-05-21 valid
1 2024-06-09 duplicate
1 2024-06-21 valid
1 2024-07-04 duplicate
1 2024-07-16 duplicate
1 2024-07-26 valid
1 2024-08-03 duplicate
1 2024-08-17 duplicate
1 2024-09-05 valid

Explanation:

Consider customer with customer_id = 1

The plan enrolled on 2024-03-05 is valid since its the first plan made by that customer.

Then, the next plan that is enrolled by the same customer is on 2024-05-21. Since its after 30 days since plan on 2024-03-05, it is valid.

Then, the next plan is on 2024-06-09 which is less than 30 days after the valid plan, which was on 2024-05-21. so it is a duplicate.

Then, the next plan is on 2024-06-21 which is more than 30 days since last valid plan, which was on 2024-05-21. so its valid.

Then, the next plan is on 2024-07-04 which is less than 30 days after the valid plan, which was on 2024-06-21. so it is a duplicate.

Then, the next plan is on 2024-07-16 which is less than 30 days after the valid plan, which was on 2024-06-21. so it is a duplicate.

Then the next plan is on 2024-07-26 which is more than 30 days since last valid plan, which was on 2024-06-21. so its valid. and so on...

I tried using lag window function, but it did not work as expected because it was considering the previous date only for comparison, whereas we need to compare each enroll_date with previous valid enroll_date.

WITH enrollments AS (
    SELECT 
        customer_id, 
        enroll_date,
        CASE 
            WHEN enroll_date - LAG(enroll_date) OVER (
                PARTITION BY customer_id ORDER BY enroll_date
            ) <= 30 THEN 
                LAG(enroll_date) OVER (
                    PARTITION BY customer_id ORDER BY enroll_date
                )
            ELSE enroll_date
        END 
        AS enroll_fill
    FROM plan
    where enroll_date is not null
)

SELECT 
    customer_id, 
    enroll_date, 
    enroll_fill,
    CASE 
        WHEN enroll_date = enroll_fill THEN 'valid' 
        ELSE 'duplicate' 
    END AS type
FROM enrollments

But this wasn't giving the expected output.

customer_id enroll_date expected enroll_fill expected type
1 2024-02-01 2024-02-01 valid
1 2024-02-15 2024-02-01 duplicate
1 2024-03-05 2024-03-05 valid
1 2024-05-21 2024-05-21 valid
1 2024-06-09 2024-05-21 duplicate
1 2024-06-21 2024-06-21 valid
1 2024-07-04 2024-06-21 duplicate
1 2024-07-16 2024-06-21 duplicate
1 2024-07-26 2024-07-26 valid
1 2024-08-03 2024-07-26 duplicate
1 2024-08-17 2024-07-26 duplicate
1 2024-09-05 2024-09-05 valid

The output given by above query:

customer_id enroll_date output enroll_fill output type
1 2024-02-01 2024-02-01 valid
1 2024-02-15 2024-02-01 duplicate
1 2024-03-05 2024-03-05 valid
1 2024-05-21 2024-05-21 valid
1 2024-06-09 2024-05-21 duplicate
1 2024-06-21 2024-06-09 duplicate
1 2024-07-04 2024-06-21 duplicate
1 2024-07-16 2024-07-04 duplicate
1 2024-07-26 2024-07-16 duplicate
1 2024-08-03 2024-07-26 duplicate
1 2024-08-17 2024-08-03 duplicate
1 2024-09-05 2024-08-17 duplicate

Kindly suggest if there are any other ways to do this.

1
  • 1
    You need a recursive CTE here. Commented Mar 11 at 12:40

2 Answers 2

1

Similar to the answers suggested, I created my own solution to this which I'm posting here for anyone looking for the answer.

with recursive 
    enrollments as (
        select 
            plan.customer_id,  
            plan.enroll_date,
            row_number() over (
                partition by plan.customer_id order by plan.enroll_date) 
            as rn
        from plan
    ),
    classified_enrollments(customer_id, enroll_date, rn, enroll_fill, enroll_type) 
    as (
        select 
            customer_id, 
            enroll_date, 
            rn, 
            enroll_date as enroll_fill,
            'valid'::text as enroll_type --first enroll is always valid
        from enrollments
        where rn = 1
        
        union all
        
        select 
            r.customer_id, 
            d.enroll_date, 
            d.rn, 
            case when d.enroll_date <= date(r.enroll_fill + interval '30d') 
                then r.enroll_fill
                else d.enroll_date
            end as enroll_fill,
            case when d.enroll_date is null then null
                when d.enroll_date <= date(r.enroll_fill + interval '30d') 
                then 'duplicate'::text
                else 'valid'::text
            end as enroll_type
        from classified_enrollments r
        join enrollments d on r.customer_id = d.customer_id and d.rn = r.rn + 1
    )
    SELECT
        customer_id, 
        enroll_date, 
        enroll_fill, 
        enroll_type
    FROM classified_enrollments;
Sign up to request clarification or add additional context in comments.

Comments

0

I answered a similar question for a PostgreSQL database.

I adapted it to your table names in a fiddle, it may need adaptations to run under Redshift (thus I don't post it as is here)
but it holds all the principles:

  1. for each date, tell which subsequent date would be the next past a 30-day period.
  2. identify sure starts of a span (those having no predecessor in the preceding 30 days)
  3. then recurse over all potential span starts, marking the "sure starts" as valid then hoping to its successor indicated by step 1.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.