Python pandas column filtering substring

Question

I have a dataframe in python3 using pandas which has a column containing a string with a date.

This is the subset of the column

ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
"2020-04-08"
"2020-04-12"

I would like to remove the rows that have the same month and day twice and keep the one with the newest year.

This would be what I would expect as a result from this subset

ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"

The last two rows where removed because 2020-04-12 and 2020-04-08 already had the dates in 2021.

I thought of doing this with an apply and lambda but my real dataframe has hundreds of rows and tens of columns so it would not be efficient. Is there a more efficient way of doing this?

Richie Youm · Accepted Answer · 2020-08-27 19:41:02Z

1

There are a couple of ways you can do this. One of them would be to extract the year, sort it by year, and drop rows with duplicate month day pair.

# separate year and month-day pairs
df['year'] = df['ColA'].apply(lambda x: x[:4])
df['mo-day'] = df['ColA'].apply(lambda x: x[5:])
df.sort_values('year', inplace=True)
print(df)

This is what it would look like after separation and sorting:

         ColA  year mo-day
2  2020-04-12  2020  04-12
3  2020-04-08  2020  04-08
4  2020-04-12  2020  04-12
0  2021-04-03  2021  04-03
1  2021-04-08  2021  04-08

Afterwards, we can simply drop the duplicates and remove the additional columns:

# drop duplicate month-day pairs
df.drop_duplicates('mo-day', keep='first', inplace=True)

# get rid of the two columns
df.drop(['year','mo-day'], axis=1, inplace=True)

# since we dropped duplicate, reset the index
df.reset_index(drop=True, inplace=True)
print(df)

Final result:

         ColA
0  2020-04-12
1  2020-04-08
2  2021-04-03

This would be much faster than if you were to convert the entire column to datetime and extract dates, as you're working with the string as is.

edited Aug 27, 2020 at 19:41

answered Aug 27, 2020 at 19:20

Richie Youm

113 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Richie Youm Over a year ago

Also your expected result seems incorrect; you wanted to keep the oldest year, so you should be keeping 2020-04-08 instead of 2021-04-08, etc.

colla Over a year ago

Yes you are correct, i updated the question sorry about that...

SimonR · Accepted Answer · 2020-08-27 19:41:19Z

0

I'm not sure you can get away from using an 'apply' to extract the relevant part of the date for grouping, but this is much easier if you first convert that column to a pandas datetime type:

df = pd.DataFrame({'colA':
["2021-04-03",
"2021-04-08",
"2020-04-12",
"2020-04-08",
"2020-04-12"]})

df['colA'] = df.colA.apply(pd.to_datetime)

Then you can group by the (day, month) and keep the highest value like so:

df.groupby(df.colA.apply(lambda x: (x.day, x.month))).max()

edited Aug 27, 2020 at 19:41

answered Aug 27, 2020 at 19:27

SimonR

1,8241 gold badge7 silver badges12 bronze badges

2 Comments

colla Over a year ago

Yes, this is what I was looking for but unfortunately when I apply your answer on my dataframe only the dates of 2020 come out and I was looking to have the dates of 2021 as priority and then if it does not exist take the date of 2020 which must exist

colla Over a year ago

Your answer was correct it's just that instead of first() it should be max(). Maybe you can edit your post so that I can accept your answer

Collectives™ on Stack Overflow

Python pandas column filtering substring

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related