2

I have a large data frame with a many columns. One of these columns is what's supposed to be a Unique ID and the other is a Year. Unfortunately, there are duplicates in the Unique ID column.

I know how to generate a list of all duplicates, but what I actually want to do is extract them out such that only the first entry (by year) remains. For example, the dataframe currently looks something like this (with a bunch of other columns):

ID    Year
----------
123   1213
123   1314
123   1516
154   1415
154   1718
233   1314
233   1415
233   1516

And what I want to do is transform this dataframe into:

ID    Year
----------
123   1213
154   1415
233   1314

While storing just the those duplicates in another dataframe:

ID    Year
----------
123   1314
123   1516
154   1415
233   1415
233   1516

I could drop duplicates by year to keep the oldest entry, but I am not sure how to get just the duplicates into a list that I can store as another dataframe.

How would I do this?

1 Answer 1

6

Use duplicated

In [187]: d = df.duplicated(subset=['ID'], keep='first')

In [188]: df[~d]
Out[188]:
    ID  Year
0  123  1213
3  154  1415
5  233  1314

In [189]: df[d]
Out[189]:
    ID  Year
1  123  1314
2  123  1516
4  154  1718
6  233  1415
7  233  1516
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.