Extract duplicates into new dataframe with Pandas

Question

I have a large data frame with a many columns. One of these columns is what's supposed to be a Unique ID and the other is a Year. Unfortunately, there are duplicates in the Unique ID column.

I know how to generate a list of all duplicates, but what I actually want to do is extract them out such that only the first entry (by year) remains. For example, the dataframe currently looks something like this (with a bunch of other columns):

And what I want to do is transform this dataframe into:

ID    Year
----------
123   1213
154   1415
233   1314

While storing just the those duplicates in another dataframe:

ID    Year
----------
123   1314
123   1516
154   1415
233   1415
233   1516

I could drop duplicates by year to keep the oldest entry, but I am not sure how to get just the duplicates into a list that I can store as another dataframe.

How would I do this?

Zero · Accepted Answer · 2018-08-27 19:11:00Z

6

Use duplicated

In [187]: d = df.duplicated(subset=['ID'], keep='first')

In [188]: df[~d]
Out[188]:
    ID  Year
0  123  1213
3  154  1415
5  233  1314

In [189]: df[d]
Out[189]:
    ID  Year
1  123  1314
2  123  1516
4  154  1718
6  233  1415
7  233  1516

answered Aug 27, 2018 at 19:11

Zero

77.4k22 gold badges153 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extract duplicates into new dataframe with Pandas

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related