removing duplicate rows in pandas DataFrame based on a condition

Question

I want to delete duplicate rows with respect to column 'a' in a dataFrame with the argument 'take_last = True' unless some condition. For instance, If I had the following dataFrame

 a | b | c
 1 | S | Blue 
 2 | M | Black
 2 | L | Blue
 1 | L | Green

I want to drop duplicate rows with respect to column 'a' with the general rule as take_last = true unless some condition say, c = 'Blue', in which case I want to make the argument take_last = false.

so that I get this as my resulting df

 a | b | c
 1 | L | Green
 2 | M | Black

I don't understand. Why does take_last=True means take only Green? — pacholik
– pacholik, Commented Oct 7, 2015 at 15:11
That's not what I was trying to do. I edited the question now. I was just trying to give an example of my situation which is, I want to retain the last duplicate row unless some condition is True. — Rakesh Adhikesavan
– Rakesh Adhikesavan, Commented Oct 7, 2015 at 15:16
Okay, now you need to explain your condition. "c = 'Blue'" is a condition which applies to a row, not a group. If I assemble all the a = 1 rows into a group, how do I determine whether or not you want the last or the first? Do you want the last unless any of the rows in the group have c = Blue? — DSM
– DSM, Commented Oct 7, 2015 at 15:24
I don't understand the example resulting df. From your description, I assumed you want to keep the last row with a given value of a, except if there are rows where c is 'Blue' - then you'd like to keep the first one of those rows, but that's not what the example df shows. — vmg
– vmg, Commented Oct 7, 2015 at 15:26

jezrael · Accepted Answer · 2015-10-07 21:17:12Z

2

#   a  b      c
#0  1  S   Blue
#1  2  M  Black
#2  2  L   Blue
#3  1  L  Green

#get first rows of groups, sort them and reset index; delete redundant col index
df1 = df.groupby('a').head(1).sort('a').reset_index()
del df1['index']

#get last rows of groups, sort them and reset index; delete redundant col index
df2 = df.groupby('a').tail(1).sort('a').reset_index()
del df2['index']
print df1
#   a  b      c
#0  1  S   Blue
#1  2  M  Black
print df2
#   a  b      c
#0  1  L  Green
#1  2  L   Blue

#if value in col c in df1 is 'Blue' replace this row with row from df2 (indexes are same)
df1.loc[df1['c'].isin(['Blue'])] = df2
print df1
#   a  b      c
#0  1  L  Green
#1  2  M  Black

answered Oct 7, 2015 at 21:17

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

removing duplicate rows in pandas DataFrame based on a condition

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related