Extracting dates in a pandas dataframe column using regex

Question

I have a data frame with a column Campaign which consists of the campaign name (start date - end date) format. I need to create 3 new columns by extracting the start and end dates.

start_date, end_date, days_between_start_and_end_date.

The issue is Campaign column value is not in a fixed format, for the below values my code block works well.

1. Season1 hero (18.02. -24.03.2021)

What I am doing in my code snippet is extracting the start date & end date from the campaign column and as you see, start date doesn't have a year. I am adding the year by checking the month value.

import pandas as pd
import re
import datetime

# read csv file
df = pd.read_csv("report.csv")

# extract start and end dates from the 'Campaign' column
dates = df['Campaign'].str.extract(r'(\d+\.\d+)\.\s*-\s*(\d+\.\d+\.\d+)')
df['start_date'] = dates[0]
df['end_date'] = dates[1]

# convert start and end dates to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m')
df['end_date'] = pd.to_datetime(df['end_date'], format='%d.%m.%Y')

# Add year to start date
for index, row in df.iterrows():
    if pd.isna(row["start_date"]) or pd.isna(row["end_date"]):
        continue
    start_month = row["start_date"].month
    end_month = row["end_date"].month
    year = row["end_date"].year
    if start_month > end_month:
        year = year - 1
    dates_str = str(row["start_date"].strftime("%d.%m")) + "." + str(year)
    df.at[index, "start_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
    dates_str = str(row["end_date"].strftime("%d.%m")) + "." + str(row["end_date"].year)
    df.at[index, "end_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")

but, I have multiple different column values where my regex fail and I receive nan values, for example

1.  Sales is on (30.12.21-12.01.2022)
2.  Sn 2 Fol CAMPAIGN A (24.03-30.03.2023)
3.  M SALE (19.04 - 04.05.2022)
4.  NEW SALE (29.12.2022-11.01.2023)
5.  Year End (18.12. - 12.01.2023)
6.  XMAS 1 - THE TRIBE CELEBRATES XMAS (18.11.-08.12.2021) (gifting communities)
   Year End (18.12. - 12.01.2023)

in all the above 4 example, my date format is completely different.

expected output

start date     end date 
2021-12-30   2022-01-22
2023-03-24   2023-03-30
2022-04-19   2022-05-04
2022-12-29   2023-01-11
2022-18-12   2023-01-12
2021-11-18   2021-12-08

Can someone please help me here?

cottontail · Accepted Answer · 2023-02-06 10:43:06Z

Since the datetimes in the data don't have a fixed format (some are dd.mm.yy, some are dd.mm.YYYY), it might be better if we apply a custom parser function that uses try-except. We can certainly do two conversions using pd.to_datetime and choose values using np.where etc. but it might not save any time given we need to do a lot of string manipulations beforehand.

To append the missing years for some rows, since pandas string methods are not optimized and as we'll need a few of them, (str.count(), str.cat() etc.) it's probably better to use Python string methods in a loop implementation instead.

Also, iterrows() is incredibly slow, so it's much faster if you use a python loop instead.

pd.to_datetime converts each element into datetime.datetime objects anyways, so we can use datetime.strptime from the built-in module to perform the conversions.

from datetime import datetime
def datetime_parser(date, end_date=None):
    # remove space around dates
    date = date.strip()
    # if the start date doesn't have year, append it from the end date
    
    dmy = date.split('.')
    if end_date and len(dmy) == 2:
        date = f"{date}.{end_date.rsplit('.', 1)[1]}"
    elif end_date and not dmy[-1]:
        edmy = end_date.split('.')
        if int(dmy[1]) > int(edmy[1]):
            date = f"{date}{int(edmy[-1])-1}"
        else:
            date = f"{date}{edmy[-1]}"
    try:
        # try 'dd.mm.YYYY' format (e.g. 29.12.2022) first
        return datetime.strptime(date, '%d.%m.%Y')
    except ValueError:
        # try 'dd.mm.yy' format (e.g. 30.12.21) if the above doesn't work out
        return datetime.strptime(date, '%d.%m.%y')

# extract dates into 2 columns (tentatively start and end dates)
splits = df['Campaign'].str.extract(r"\((.*?)-(.*?)\)").values.tolist()
# parse the dates
df[['start_date', 'end_date']] = [[datetime_parser(start, end), datetime_parser(end)] for start, end in splits]
# find difference
df['days_between_start_and_end_date'] = df['end_date'] - df['start_date']

@sdave Edited the code again. It covers the additional nuances. Made the regex matching lazy and added a month conditional check.

Timeless · Accepted Answer · 2023-02-03 14:06:45Z

0

I would do a basic regex with extract and then perform slicing :

ser = df["Campaign"].str.extract(r"\((.*)\)", expand=False)

start_date = ser.str.strip().str[-10:]
#or ser.str.strip().str.rsplit("-").str[-1]

end_date = ser.str.strip().str.split("\s*-\s*").str[0]

NB : You can assign the Series start_date and end_date to create your two new column.

Output :

start_date, end_date
(1.0    12.01.2022            # <- start_date
 2.0    30.03.2023
 3.0    04.05.2022
 4.0    11.01.2023
 Name: Campaign, dtype: object,
 1.0      30.12.21            # <- end_date
 2.0         24.03
 3.0         19.04
 4.0    29.12.2022
 Name: Campaign, dtype: object)

answered Feb 3, 2023 at 14:06

Timeless

38.3k6 gold badges33 silver badges54 bronze badges

5 Comments

sdave Over a year ago

Thank you, but how do we convert it into date format, without year information?

Timeless Over a year ago

For example for the value 24.03 what's the format of date you're looking for ? Can you tell the exact value ?

sdave Over a year ago

we have multiple formats available - 31.08. , 30.12.21, 29.12.2022, 24.03 in start date

Timeless Over a year ago

In your question, you gave 4 lines/examples. Can you add the expected output ? This way, it'll be easy ;)

sdave Over a year ago

oh yes, I have added expected output :)

Collectives™ on Stack Overflow

Extracting dates in a pandas dataframe column using regex

2 Answers 2

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related