How to delete rows from dataframe based on condition

Question

I have the following dataframe with ("ID", "Month" and "status"). Status is regarding "Churn"= 1 and 'Not Churn" = 2. I want to delete all rows for ID's who are already churned except the first appearance. For example:

Dataframe

    ID      Month   Status
    2310    201708  2
    2310    201709  2
    2310    201710  1
    2310    201711  1
    2310    201712  1
    2310    201801  1
    2311    201704  2
    2311    201705  2
    2311    201706  2
    2311    201707  2
    2311    201708  2
    2311    201709  2
    2311    201710  1
    2311    201711  1
    2311    201712  1
    2312    201708  2
    2312    201709  2
    2312    201710  2
    2312    201711  1
    2312    201712  1
    2312    201801  1

After deleting I should have the following dataframe

    ID      Month   Status
    2310    201708  2
    2310    201709  2
    2310    201710  1

    2311    201704  2
    2311    201705  2
    2311    201706  2
    2311    201707  2
    2311    201708  2
    2311    201709  2
    2311    201710  1

    2312    201708  2
    2312    201709  2
    2312    201710  2
    2312    201711  1

I tried the following- first to find min date for each customer ID and status=1

    df1=df[df.Status==1].groupby('ID')['Month'].min()

then I have to delete all rows for each ID with status 1 greater than min value for MOnth.

why do you have results for 2311 for status 2 when it changes to 1 later on, shouldn't that get dropped — Umar.H
– Umar.H, Commented Jan 26, 2020 at 20:46
I have to keep all rows until the first time changed to 1. So I keep all rows for ID with value 2 and first row when the status changed to 1 — zdz
– zdz, Commented Jan 26, 2020 at 20:50
Please share some data in a way that makes it easy for others to test solutions. — AMC
– AMC, Commented Jan 27, 2020 at 0:47

dkhara · Accepted Answer · 2020-01-29 12:51:01Z

1

If you're familiar with DataFrame.idxmin to get the indices of the elements of the most recent month, you could try:

# find minimum months
min_df = df.groupby(['ID','Status'])['Month'].idxmin().reset_index(drop=True)
# find indices of status 2 rows
df2 = df[df['Status'].eq(2)].index.to_series()
# append indices together
idx_df = min_df.append(df2).drop_duplicates()
# filter indices
df_new = df.iloc[idx_df].sort_index()

print(df_new)                                                                        
      ID   Month  Status
0   2310  201708       2
1   2310  201709       2
2   2310  201710       1
6   2311  201704       2
7   2311  201705       2
8   2311  201706       2
9   2311  201707       2
10  2311  201708       2
11  2311  201709       2
12  2311  201710       1
15  2312  201708       2
16  2312  201709       2
17  2312  201710       2
18  2312  201711       1

Update

Or, you may think about using GroupBy.apply:

df1 = df.groupby(['ID','Status']).apply(lambda x: (x['Status'].eq(2)) | (x['Month'].eq(x['Month'].min())))
df1 = df1.reset_index(level=['ID','Status'], drop=True)
df_new = df.loc[df1]

print(df_new)                                                                                                                                              
      ID   Month  Status
0   2310  201708       2
1   2310  201709       2
2   2310  201710       1
6   2311  201704       2
7   2311  201705       2
8   2311  201706       2
9   2311  201707       2
10  2311  201708       2
11  2311  201709       2
12  2311  201710       1
15  2312  201708       2
16  2312  201709       2
17  2312  201710       2
18  2312  201711       1

Update 2

However, if you're simply wanting to remove all of the status 1 rows that come after the row with the earliest month, then you could simply sort_values and transform:

df = df.sort_values(by=['ID','Month']).reset_index(drop=True) 
df = df[df.groupby('ID')['Status'].transform(lambda x: ~(x.duplicated() & (x == 1)))]

print(df)                                                              
      ID   Month  Status
0   2310  201708       2
1   2310  201709       2
2   2310  201710       1
6   2311  201704       2
7   2311  201705       2
8   2311  201706       2
9   2311  201707       2
10  2311  201708       2
11  2311  201709       2
12  2311  201710       1
15  2312  201708       2
16  2312  201709       2
17  2312  201710       2
18  2312  201711       1

edited Jan 29, 2020 at 12:51

answered Jan 26, 2020 at 21:18

dkhara

7155 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

zdz Over a year ago

It doesn't work on real dataframe. The rows with status 1 are still there

dkhara Over a year ago

Could you provide any info about those remaining status 1 rows? In your example, the IDs and months were already ordered. If they aren't ordered in your actual dataframe, you may not receive the expected output.

zdz Over a year ago

That's it! Thank you. Meanwhile, I found another issue with Churn and not Churn statuses. Situation 1. The customer was inactive (Status = 1 ) and then become active (Status = 2). I have to delete all rows for each customer with status 1 if that was before status 2 Situation 2. The customer was only in status = 1 during the observed period. So I have to delete all rows for each customer with status 1 if there are no other statuses during the observed period

zdz Over a year ago

Tnx, dkhara, you are right. You solved my problem. I opened new issue with additional questions 'Deleting rows based on groupby conditions'

zdz Over a year ago

done. could you also see this second question? Probably you can find the answer

|

Scott Boston · Accepted Answer · 2020-01-28 16:31:50Z

1

IIUC, you can use groupby with transform with boolean logic and then boolean indexing:

df[df.groupby('ID')['Status'].transform(lambda x: ~(x.duplicated() & (x == 1)))]

Output:

      ID   Month  Status
0   2310  201708       2
1   2310  201709       2
2   2310  201710       1
6   2311  201704       2
7   2311  201705       2
8   2311  201706       2
9   2311  201707       2
10  2311  201708       2
11  2311  201709       2
12  2311  201710       1
15  2312  201708       2
16  2312  201709       2
17  2312  201710       2
18  2312  201711       1

edited Jan 28, 2020 at 16:31

answered Jan 28, 2020 at 16:13

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Collectives™ on Stack Overflow

How to delete rows from dataframe based on condition

2 Answers 2

10 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related