Identify duplicates within a period of time using Redshift SQL

Question

In a table, I have plan details of customers with their customer_id and enroll_date.

Now, I want to identify duplicate and valid enrollments from the overall data.

Duplicate: If a customer enrolls a plan, and then enrolls for another plan within 30 (<=30) days, then the second plan is considered as duplicate. Others are valid.

Example:

customer_id	enroll_date	type
2	2024-02-05	valid
2	2024-02-25	duplicate
1	2024-03-05	valid
1	2024-05-21	valid
1	2024-06-09	duplicate
1	2024-06-21	valid
1	2024-07-04	duplicate
1	2024-07-16	duplicate
1	2024-07-26	valid
1	2024-08-03	duplicate
1	2024-08-17	duplicate
1	2024-09-05	valid

Explanation:

Consider customer with customer_id = 1

The plan enrolled on 2024-03-05 is valid since its the first plan made by that customer.

Then, the next plan that is enrolled by the same customer is on 2024-05-21. Since its after 30 days since plan on 2024-03-05, it is valid.

Then, the next plan is on 2024-06-09 which is less than 30 days after the valid plan, which was on 2024-05-21. so it is a duplicate.

Then, the next plan is on 2024-06-21 which is more than 30 days since last valid plan, which was on 2024-05-21. so its valid.

Then, the next plan is on 2024-07-04 which is less than 30 days after the valid plan, which was on 2024-06-21. so it is a duplicate.

Then, the next plan is on 2024-07-16 which is less than 30 days after the valid plan, which was on 2024-06-21. so it is a duplicate.

Then the next plan is on 2024-07-26 which is more than 30 days since last valid plan, which was on 2024-06-21. so its valid. and so on...

I tried using lag window function, but it did not work as expected because it was considering the previous date only for comparison, whereas we need to compare each enroll_date with previous valid enroll_date.

WITH enrollments AS (
    SELECT 
        customer_id, 
        enroll_date,
        CASE 
            WHEN enroll_date - LAG(enroll_date) OVER (
                PARTITION BY customer_id ORDER BY enroll_date
            ) <= 30 THEN 
                LAG(enroll_date) OVER (
                    PARTITION BY customer_id ORDER BY enroll_date
                )
            ELSE enroll_date
        END 
        AS enroll_fill
    FROM plan
    where enroll_date is not null
)

SELECT 
    customer_id, 
    enroll_date, 
    enroll_fill,
    CASE 
        WHEN enroll_date = enroll_fill THEN 'valid' 
        ELSE 'duplicate' 
    END AS type
FROM enrollments

But this wasn't giving the expected output.

customer_id	enroll_date	expected enroll_fill	expected type
1	2024-02-01	2024-02-01	valid
1	2024-02-15	2024-02-01	duplicate
1	2024-03-05	2024-03-05	valid
1	2024-05-21	2024-05-21	valid
1	2024-06-09	2024-05-21	duplicate
1	2024-06-21	2024-06-21	valid
1	2024-07-04	2024-06-21	duplicate
1	2024-07-16	2024-06-21	duplicate
1	2024-07-26	2024-07-26	valid
1	2024-08-03	2024-07-26	duplicate
1	2024-08-17	2024-07-26	duplicate
1	2024-09-05	2024-09-05	valid

The output given by above query:

customer_id	enroll_date	output enroll_fill	output type
1	2024-02-01	2024-02-01	valid
1	2024-02-15	2024-02-01	duplicate
1	2024-03-05	2024-03-05	valid
1	2024-05-21	2024-05-21	valid
1	2024-06-09	2024-05-21	duplicate
1	2024-06-21	2024-06-09	duplicate
1	2024-07-04	2024-06-21	duplicate
1	2024-07-16	2024-07-04	duplicate
1	2024-07-26	2024-07-16	duplicate
1	2024-08-03	2024-07-26	duplicate
1	2024-08-17	2024-08-03	duplicate
1	2024-09-05	2024-08-17	duplicate

Kindly suggest if there are any other ways to do this.

You need a recursive CTE here.

Tim Biegeleisen
– Tim Biegeleisen

2025-03-11 12:40:16 +00:00
Commented Mar 11 at 12:40 — Tim Biegeleisen
– Tim Biegeleisen, Commented Mar 11 at 12:40

Lakshmi Sruthi K · Accepted Answer · 2025-03-12 18:08:59Z

Similar to the answers suggested, I created my own solution to this which I'm posting here for anyone looking for the answer.

with recursive 
    enrollments as (
        select 
            plan.customer_id,  
            plan.enroll_date,
            row_number() over (
                partition by plan.customer_id order by plan.enroll_date) 
            as rn
        from plan
    ),
    classified_enrollments(customer_id, enroll_date, rn, enroll_fill, enroll_type) 
    as (
        select 
            customer_id, 
            enroll_date, 
            rn, 
            enroll_date as enroll_fill,
            'valid'::text as enroll_type --first enroll is always valid
        from enrollments
        where rn = 1
        
        union all
        
        select 
            r.customer_id, 
            d.enroll_date, 
            d.rn, 
            case when d.enroll_date <= date(r.enroll_fill + interval '30d') 
                then r.enroll_fill
                else d.enroll_date
            end as enroll_fill,
            case when d.enroll_date is null then null
                when d.enroll_date <= date(r.enroll_fill + interval '30d') 
                then 'duplicate'::text
                else 'valid'::text
            end as enroll_type
        from classified_enrollments r
        join enrollments d on r.customer_id = d.customer_id and d.rn = r.rn + 1
    )
    SELECT
        customer_id, 
        enroll_date, 
        enroll_fill, 
        enroll_type
    FROM classified_enrollments;

Guillaume Outters · Accepted Answer · 2025-03-11 15:12:54Z

0

I answered a similar question for a PostgreSQL database.

I adapted it to your table names in a fiddle, it may need adaptations to run under Redshift (thus I don't post it as is here)
but it holds all the principles:

for each date, tell which subsequent date would be the next past a 30-day period.
identify sure starts of a span (those having no predecessor in the preceding 30 days)
then recurse over all potential span starts, marking the "sure starts" as valid then hoping to its successor indicated by step 1.

edited Mar 11 at 15:12

answered Mar 11 at 13:11

Guillaume Outters

7,5041 gold badge22 silver badges28 bronze badges

Collectives™ on Stack Overflow

Identify duplicates within a period of time using Redshift SQL

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related