1

I have a dataframe in python3 using pandas which has a column containing a string with a date.

This is the subset of the column

ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
"2020-04-08"
"2020-04-12"

I would like to remove the rows that have the same month and day twice and keep the one with the newest year.

This would be what I would expect as a result from this subset

ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"

The last two rows where removed because 2020-04-12 and 2020-04-08 already had the dates in 2021.

I thought of doing this with an apply and lambda but my real dataframe has hundreds of rows and tens of columns so it would not be efficient. Is there a more efficient way of doing this?

2 Answers 2

1

There are a couple of ways you can do this. One of them would be to extract the year, sort it by year, and drop rows with duplicate month day pair.

# separate year and month-day pairs
df['year'] = df['ColA'].apply(lambda x: x[:4])
df['mo-day'] = df['ColA'].apply(lambda x: x[5:])
df.sort_values('year', inplace=True)
print(df)

This is what it would look like after separation and sorting:

         ColA  year mo-day
2  2020-04-12  2020  04-12
3  2020-04-08  2020  04-08
4  2020-04-12  2020  04-12
0  2021-04-03  2021  04-03
1  2021-04-08  2021  04-08

Afterwards, we can simply drop the duplicates and remove the additional columns:

# drop duplicate month-day pairs
df.drop_duplicates('mo-day', keep='first', inplace=True)

# get rid of the two columns
df.drop(['year','mo-day'], axis=1, inplace=True)

# since we dropped duplicate, reset the index
df.reset_index(drop=True, inplace=True)
print(df)

Final result:

         ColA
0  2020-04-12
1  2020-04-08
2  2021-04-03

This would be much faster than if you were to convert the entire column to datetime and extract dates, as you're working with the string as is.

Sign up to request clarification or add additional context in comments.

2 Comments

Also your expected result seems incorrect; you wanted to keep the oldest year, so you should be keeping 2020-04-08 instead of 2021-04-08, etc.
Yes you are correct, i updated the question sorry about that...
0

I'm not sure you can get away from using an 'apply' to extract the relevant part of the date for grouping, but this is much easier if you first convert that column to a pandas datetime type:

df = pd.DataFrame({'colA':
["2021-04-03",
"2021-04-08",
"2020-04-12",
"2020-04-08",
"2020-04-12"]})

df['colA'] = df.colA.apply(pd.to_datetime)

Then you can group by the (day, month) and keep the highest value like so:

df.groupby(df.colA.apply(lambda x: (x.day, x.month))).max()

2 Comments

Yes, this is what I was looking for but unfortunately when I apply your answer on my dataframe only the dates of 2020 come out and I was looking to have the dates of 2021 as priority and then if it does not exist take the date of 2020 which must exist
Your answer was correct it's just that instead of first() it should be max(). Maybe you can edit your post so that I can accept your answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.