0

I have a Pandas dataframe that has already been reduced to duplicates only and sorted. Duplicates are identified by column "HASH" and then sorted by "HASH" and "SIZE"

df_out['is_duplicated'] = df.duplicated(['HASH'], keep=False) #keep=False: mark all duplicates as true
df_out = df_out.ix[(df_out['is_duplicated'] == True)] #Keep only duplicate records
df_out = df_out.sort_values(['HASH', 'SIZE'], ascending=[True, False]) #Sort by "HASH", then by "SIZE"

Result:

HASH  SIZE  is_duplicated
1      5     TRUE
1      3     TRUE
1      2     TRUE
9      7     TRUE
9      5     TRUE

I would like to add 2 more columns. First column would identify rows of data with the same "HASH" by an ID. First set of rows with the same "HASH" would be 1, next set would be 2, etc...

Second column would mark the a single row in each group that has the largest "SIZE"

HASH  SIZE ID   KEEP
1      5   1    TRUE
1      3   1    FALSE
1      2   1    FALSE
9      7   2    TRUE
9      5   2    FALSE

1 Answer 1

1

Perhaps use dicts and list comprehension:

import pandas as pd
df = pd.DataFrame([[1,1,1,9,9],[5,3,2,7,5]]).T
df.columns = ['HASH','SIZE']

hash_dict = dict(zip(df.HASH.unique(),range(1,df.HASH.nunique()+1)))
df['ID'] = [hash_dict[k] for k in df.HASH]

max_dict = dict(df.groupby('HASH')['SIZE'].max())
df['KEEP'] = [True if b==max_dict[a] else False for a,b in zip(df.HASH,df.SIZE)]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.