I have a Pandas dataframe that has already been reduced to duplicates only and sorted. Duplicates are identified by column "HASH" and then sorted by "HASH" and "SIZE"
df_out['is_duplicated'] = df.duplicated(['HASH'], keep=False) #keep=False: mark all duplicates as true
df_out = df_out.ix[(df_out['is_duplicated'] == True)] #Keep only duplicate records
df_out = df_out.sort_values(['HASH', 'SIZE'], ascending=[True, False]) #Sort by "HASH", then by "SIZE"
Result:
HASH SIZE is_duplicated
1 5 TRUE
1 3 TRUE
1 2 TRUE
9 7 TRUE
9 5 TRUE
I would like to add 2 more columns. First column would identify rows of data with the same "HASH" by an ID. First set of rows with the same "HASH" would be 1, next set would be 2, etc...
Second column would mark the a single row in each group that has the largest "SIZE"
HASH SIZE ID KEEP
1 5 1 TRUE
1 3 1 FALSE
1 2 1 FALSE
9 7 2 TRUE
9 5 2 FALSE