Python Pandas mark all but one specific duplicate row

Question

I have a Pandas dataframe that has already been reduced to duplicates only and sorted. Duplicates are identified by column "HASH" and then sorted by "HASH" and "SIZE"

df_out['is_duplicated'] = df.duplicated(['HASH'], keep=False) #keep=False: mark all duplicates as true
df_out = df_out.ix[(df_out['is_duplicated'] == True)] #Keep only duplicate records
df_out = df_out.sort_values(['HASH', 'SIZE'], ascending=[True, False]) #Sort by "HASH", then by "SIZE"

Result:

HASH  SIZE  is_duplicated
1      5     TRUE
1      3     TRUE
1      2     TRUE
9      7     TRUE
9      5     TRUE

I would like to add 2 more columns. First column would identify rows of data with the same "HASH" by an ID. First set of rows with the same "HASH" would be 1, next set would be 2, etc...

Second column would mark the a single row in each group that has the largest "SIZE"

HASH  SIZE ID   KEEP
1      5   1    TRUE
1      3   1    FALSE
1      2   1    FALSE
9      7   2    TRUE
9      5   2    FALSE

Ezer K · Accepted Answer · 2017-03-31 05:10:55Z

1

Perhaps use dicts and list comprehension:

import pandas as pd
df = pd.DataFrame([[1,1,1,9,9],[5,3,2,7,5]]).T
df.columns = ['HASH','SIZE']

hash_dict = dict(zip(df.HASH.unique(),range(1,df.HASH.nunique()+1)))
df['ID'] = [hash_dict[k] for k in df.HASH]

max_dict = dict(df.groupby('HASH')['SIZE'].max())
df['KEEP'] = [True if b==max_dict[a] else False for a,b in zip(df.HASH,df.SIZE)]

edited Mar 31, 2017 at 5:10

answered Mar 30, 2017 at 20:24

Ezer K

3,7615 gold badges25 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Pandas mark all but one specific duplicate row

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related