Pandas - Removing duplicates based on value in specific column

Question

I have a Dataframe that looks like this:

prod_id, prod_name, col_1, col_2, type
101, electronic, 10, 10, old
102, hardware, 2, 4, old
101, electronic, 10, 10, new
102, hardware, 2, 1, new
103, other, 22, 13, new

I am trying to update my Dataframe such that the updated Dataframe has the row with type=old if all other columns are same else using value from type=new

Final output:

prod_id, prod_name, col_1, col_2, type
101, electronic, 10, 10, old
102, hardware, 2, 1, new
103, other, 22, 13, new

did you look at the Dataframe.drop_duplicates method? It takes a subset parameter — Paul H
– Paul H, Commented Aug 18, 2020 at 14:08
@PaulH, I did check out drop_duplicates method. What I am having trouble is apply drop_duplicates based on values in a column — Kevin Nash
– Kevin Nash, Commented Aug 18, 2020 at 14:10
also, check your example output. you have a "new" electronic row but you included the "old" row, which seems to contradict your problem statement — Paul H
– Paul H, Commented Aug 18, 2020 at 14:15
@PaulH, may be I did not put this correctly in the first place. If all rows (except type) are same, then I would prefer to take the first occurrence. If the values in any of the columns have a mismatch then I would like to take the latest row. On the other question, I did try df.drop_duplicates(subset=['col_1','col_2']) would perform the duplicate elimination but I am trying to have a check on type column before applying the drop_duplicates method — Kevin Nash
– Kevin Nash, Commented Aug 18, 2020 at 14:18

anky · Accepted Answer · 2020-08-18 14:36:32Z

1

From what I understand, you try with 2 boolean masks one checking if there is no duplicated values and type is new and another keeping the type='old' when there is duplicated,

u = df.drop("type",1)
c = ((u.duplicated(keep=False) & df['type'].eq('old')) | 
     (df['type'].eq('new') & ~u.duplicated(keep=False)) )
out = df[c].copy()

   prod_id   prod_name  col_1  col_2 type
0      101  electronic     10     10  old
3      102    hardware      2      1  new
4      103       other     22     13  new

edited Aug 18, 2020 at 14:36

answered Aug 18, 2020 at 14:16

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Valdi_Bo · Accepted Answer · 2020-08-18 14:31:29Z

0

As I see, you want the result to contain a single row from each source group of rows with each prod_id (more precisely, the last row).

The content of type column depends on whether all values in all col_... columns, actually in columns from 2 to the last but one, are the same.

To get this result, define the following function:

def grpRes(grp):
    res = grp.iloc[-1,:]
    res.type = 'old' if np.unique(grp.values[:, 2:-1]).size == 1 else 'new'
    return res

Then apply this function to each group:

result = df.groupby('prod_id').apply(grpRes).reset_index(drop=True)

The result is:

   prod_id   prod_name  col_1  col_2 type
0      101  electronic     10     10  old
1      102    hardware      2      1  new
2      103       other     22     13  new

edited Aug 18, 2020 at 14:31

answered Aug 18, 2020 at 14:25

Valdi_Bo

31.1k4 gold badges29 silver badges45 bronze badges

Comments

Ajay A · Accepted Answer · 2020-08-18 14:43:51Z

0

There is a simple solution, if and only if type = 'old' comes first in all the duplicated rows

columns = list(df.columns)
columns.remove('type')
df = df.drop_duplicates(subset=columns, keep='first')

answered Aug 18, 2020 at 14:43

Ajay A

1,0531 gold badge9 silver badges20 bronze badges

Collectives™ on Stack Overflow

Pandas - Removing duplicates based on value in specific column

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related