Marking duplicates in a separate column using pandas

Question

Suppose I have the following pandas data frame, df1, in a jupyter notebook from an excel file:

Name    ID      Password
A       User_1  PW_1
A       User_2  PW_2
A       User_3  PW_3
B       User_4  PW_4
B       User_5  PW_5
C       User_6  PW_6

I'd like to add a new column, called STAT, that goes through the Name column, and for every item in Name, if the previous cell in Name contained the same item, print dup (for duplicate) in STAT; otherwise, don't put anything. In my example above, Users 2,3, and 5 should have dup in the SRC column after my operation.

Here is my attempt. I add a new blank column called STAT using df1.insert, and then I run:

for index, name in enumerate(df1['Name']):
    if index > 0:
        if df1['Name'][index - 1] == name:
            df1.ix[index, 'STAT'] = 'dup'`

This works fine, but I'd like to know

a) if it can be improved

and more importantly

b) Why it's throwing a A value is trying to be set on a copy of a slice from a DataFrame warning despite my using .ix. Even .loc throws the warning.

It would be easy to check ordinarily, but I'm using jupyter notebook in PyCharm, and every time I reload the file I get a _xrsf argument missing from POST.

Relevant snippet of code, applied to my actual example. df names will differ:

sort_full = full_set.sort_values(['Name','SRC'])
dupless_full = sort_full.drop_duplicates(subset = ['Name', 'ER', 'ID', 
'PW'], keep = 'last')
dupless_full.reset_index(drop = True, inplace = True)

dupless_full['STAT'] = np.where(dupless_full['Name'] == 
dupless_full['Name'].shift(), 'dup', "")

Yes. In fact, they are sorted by another column, called SRC, taking values A or B, after they have been sorted by name. I chose to not include that information. — Johnny Apple
– Johnny Apple, Commented Sep 18, 2017 at 22:02

Vaishali · Accepted Answer · 2017-09-18 04:56:21Z

4

You can use np.where

df1['Stat'] = np.where(df['Name'] == df['Name'].shift(), 'Dupe', np.nan)

    Name    ID      Password    Stat
0   A       User_1  PW_1        nan
1   A       User_2  PW_2        Dupe
2   B       User_3  PW_3        nan
3   C       User_4  PW_4        nan

answered Sep 18, 2017 at 4:56

Vaishali

38.5k5 gold badges62 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Johnny Apple Over a year ago

Well, again, this works, and is faster than my solution, but it still throws the same warning.

Vaishali Over a year ago

How did you generate the dataframe? If its not generated using .copy(), it would lead to copy warning.

Johnny Apple Over a year ago

I'm not using .copy(), because I was told that was never the correct way to generate a data frame. Is that nonsense?

Vaishali Over a year ago

Can you post the code that you are using to create the dataframe?

Johnny Apple Over a year ago

Tacked it on to the end.

|

jezrael · Accepted Answer · 2017-09-18 05:21:45Z

0

If values in column Name are sorted is possible use duplicated for boolean mask:

df1['Stat'] = np.where(df1['Name'].duplicated(), 'Dupe', '')
print (df1)
  Name      ID Password  Stat
0    A  User_1     PW_1      
1    A  User_2     PW_2  Dupe
2    B  User_3     PW_3      
3    C  User_4     PW_4

If values are not sorted, I add comparison with another answer:

df1['Stat_shift'] = np.where(df1['Name'] == df1['Name'].shift(), 'Dupe', np.nan)
df1['Stat_duplicated'] = np.where(df1['Name'].duplicated(), 'Dupe', '')
print (df1)
  Name      ID Password Stat_shift Stat_duplicated
0    A  User_1     PW_1        nan                
1    A  User_2     PW_2       Dupe            Dupe
2    B  User_3     PW_3        nan                
3    A  User_2     PW_2        nan            Dupe
4    C  User_4     PW_4        nan                
5    B  User_3     PW_3        nan            Dupe
6    B  User_3     PW_3       Dupe            Dupe

edited Sep 18, 2017 at 5:21

answered Sep 18, 2017 at 5:03

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

6 Comments

Zero Over a year ago

I think, OP wants to check only previous value, not all duplicates in frame.

jezrael Over a year ago

It is possible, but if values are sorted this should workling too.

Zero Over a year ago

I meant, only if values are sorted, this'll work, but I'm not sure if OP stated that anywhere.

jezrael Over a year ago

Yes, exactly. So I add this to answer.

Johnny Apple Over a year ago

I want to list all duplicates as 'dup' except the first one. Let me edit my example, to clarify.

|

Collectives™ on Stack Overflow

Marking duplicates in a separate column using pandas

2 Answers 2

7 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related