1

I know there are lot of questions about removing duplicates from pandas dataframe but this is bit different.

I am trying to remove duplicates from the dataframe but not getting the actual output as in the below given result dataframe. Actually the data in table is too long. For understanding purpose I have given the dummy data here in the table.

Condition:-

I need to remove duplicates and get the rows that contain max value from diast column.

Is there a good way to get result dataframe using given df.

Any help would be appreciated. Thanks :)

DF:-

age syst diast a b c d
29 90 57 MO MO 0 MO
29 90 58 MO MO 0 MO
29 90 59 MO MO 0 MO
29 90 60 MO MO 0 MO
29 90 61 0 0 0 0
29 90 62 0 0 0 0
29 90 63 0 0 0 0
29 90 64 0 0 0 0
29 90 65 MO MO 0 MO
29 90 66 MO MO 0 MO
29 90 67 MO MO 0 MO
29 90 68 MO MO 0 MO

Result:-

age syst diast a b c d
29 90 60 MO MO 0 MO
29 90 64 0 0 0 0
29 90 68 MO MO 0 MO
5
  • what is the specific condition, please add that as part of your question Commented Oct 19, 2022 at 18:22
  • Please see the updated question. I need to get the result dataframe using given df. Commented Oct 19, 2022 at 18:26
  • why you get the diast 60, first row? shouldn't that be removed too Commented Oct 19, 2022 at 18:28
  • No, that should not be removed. That's why I need a help how to do this. Commented Oct 19, 2022 at 18:31
  • posted the answer, does it help? Commented Oct 19, 2022 at 18:51

2 Answers 2

1

can you try this:

df['id']=df.groupby(['age', 'syst', 'a', 'b', 'c', 'd']).ngroup()
df['id2']=df['id'].shift(-1)

df2=df.drop_duplicates(subset=['age', 'syst', 'a', 'b', 'c', 'd','id','id2'],keep=False).drop(['id','id2'],axis=1)
print(df2)
'''
    age  syst  diast   a   b  c   d
3    29    90     60  MO  MO  0  MO
7    29    90     64   0   0  0   0
11   29    90     68  MO  MO  0  MO
'''
Sign up to request clarification or add additional context in comments.

3 Comments

First, I removed the diast column and grouped the remaining columns among themselves and gave each group an id. In your example, we have 3 groups. I then assign the value of these ids from the previous row in a new column. If id and id2 are different, it means that it has moved to a new group. Finally, we left all the same values ​​except for diast. As a result, we got the expected output. NOTE: you may need to sort the diast column if different results come out.
What keep=False does?
its drop all duplicates.
1
# create a flag to separate out the group based on column 'a'
# a is the only column that distinguishes the group
df['flag'] = np.nan
df['flag']=df['flag'].mask(df['a'].ne(df['a'].shift()), 1).cumsum().ffill()

# sort, drop duplicates, keep flag as one of the column
# finally drop the flag column
(df.sort_values(['age','syst','diast'])
 .drop_duplicates(subset=['age','syst', 'a','b','c','d','flag'], keep='last')
 .drop(columns='flag'))
    age     syst    diast   a   b   c   d
3    29     90         60   MO  MO  0   MO
7    29     90         64   0   0   0   0
11   29     90         68   MO  MO  0   MO

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.