1

I have a data frame with a column Campaign which consists of the campaign name (start date - end date) format. I need to create 3 new columns by extracting the start and end dates.

start_date, end_date, days_between_start_and_end_date. 

The issue is Campaign column value is not in a fixed format, for the below values my code block works well.

1. Season1 hero (18.02. -24.03.2021)

What I am doing in my code snippet is extracting the start date & end date from the campaign column and as you see, start date doesn't have a year. I am adding the year by checking the month value.

import pandas as pd
import re
import datetime

# read csv file
df = pd.read_csv("report.csv")

# extract start and end dates from the 'Campaign' column
dates = df['Campaign'].str.extract(r'(\d+\.\d+)\.\s*-\s*(\d+\.\d+\.\d+)')
df['start_date'] = dates[0]
df['end_date'] = dates[1]

# convert start and end dates to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m')
df['end_date'] = pd.to_datetime(df['end_date'], format='%d.%m.%Y')

# Add year to start date
for index, row in df.iterrows():
    if pd.isna(row["start_date"]) or pd.isna(row["end_date"]):
        continue
    start_month = row["start_date"].month
    end_month = row["end_date"].month
    year = row["end_date"].year
    if start_month > end_month:
        year = year - 1
    dates_str = str(row["start_date"].strftime("%d.%m")) + "." + str(year)
    df.at[index, "start_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
    dates_str = str(row["end_date"].strftime("%d.%m")) + "." + str(row["end_date"].year)
    df.at[index, "end_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")

but, I have multiple different column values where my regex fail and I receive nan values, for example

1.  Sales is on (30.12.21-12.01.2022)
2.  Sn 2 Fol CAMPAIGN A (24.03-30.03.2023)
3.  M SALE (19.04 - 04.05.2022)
4.  NEW SALE (29.12.2022-11.01.2023)
5.  Year End (18.12. - 12.01.2023)
6.  XMAS 1 - THE TRIBE CELEBRATES XMAS (18.11.-08.12.2021) (gifting communities)
   Year End (18.12. - 12.01.2023)

in all the above 4 example, my date format is completely different.

expected output

start date     end date 
2021-12-30   2022-01-22
2023-03-24   2023-03-30
2022-04-19   2022-05-04
2022-12-29   2023-01-11
2022-18-12   2023-01-12
2021-11-18   2021-12-08

Can someone please help me here?

2 Answers 2

1

Since the datetimes in the data don't have a fixed format (some are dd.mm.yy, some are dd.mm.YYYY), it might be better if we apply a custom parser function that uses try-except. We can certainly do two conversions using pd.to_datetime and choose values using np.where etc. but it might not save any time given we need to do a lot of string manipulations beforehand.

To append the missing years for some rows, since pandas string methods are not optimized and as we'll need a few of them, (str.count(), str.cat() etc.) it's probably better to use Python string methods in a loop implementation instead.

Also, iterrows() is incredibly slow, so it's much faster if you use a python loop instead.

pd.to_datetime converts each element into datetime.datetime objects anyways, so we can use datetime.strptime from the built-in module to perform the conversions.

from datetime import datetime
def datetime_parser(date, end_date=None):
    # remove space around dates
    date = date.strip()
    # if the start date doesn't have year, append it from the end date
    
    dmy = date.split('.')
    if end_date and len(dmy) == 2:
        date = f"{date}.{end_date.rsplit('.', 1)[1]}"
    elif end_date and not dmy[-1]:
        edmy = end_date.split('.')
        if int(dmy[1]) > int(edmy[1]):
            date = f"{date}{int(edmy[-1])-1}"
        else:
            date = f"{date}{edmy[-1]}"
    try:
        # try 'dd.mm.YYYY' format (e.g. 29.12.2022) first
        return datetime.strptime(date, '%d.%m.%Y')
    except ValueError:
        # try 'dd.mm.yy' format (e.g. 30.12.21) if the above doesn't work out
        return datetime.strptime(date, '%d.%m.%y')

# extract dates into 2 columns (tentatively start and end dates)
splits = df['Campaign'].str.extract(r"\((.*?)-(.*?)\)").values.tolist()
# parse the dates
df[['start_date', 'end_date']] = [[datetime_parser(start, end), datetime_parser(end)] for start, end in splits]
# find difference
df['days_between_start_and_end_date'] = df['end_date'] - df['start_date']

res

Sign up to request clarification or add additional context in comments.

1 Comment

@sdave Edited the code again. It covers the additional nuances. Made the regex matching lazy and added a month conditional check.
0

I would do a basic regex with extract and then perform slicing :

ser = df["Campaign"].str.extract(r"\((.*)\)", expand=False)
​
start_date = ser.str.strip().str[-10:]
#or ser.str.strip().str.rsplit("-").str[-1]

end_date = ser.str.strip().str.split("\s*-\s*").str[0]
​

NB : You can assign the Series start_date and end_date to create your two new column.

Output :

start_date, end_date
(1.0    12.01.2022            # <- start_date
 2.0    30.03.2023
 3.0    04.05.2022
 4.0    11.01.2023
 Name: Campaign, dtype: object,
 1.0      30.12.21            # <- end_date
 2.0         24.03
 3.0         19.04
 4.0    29.12.2022
 Name: Campaign, dtype: object)

5 Comments

Thank you, but how do we convert it into date format, without year information?
For example for the value 24.03 what's the format of date you're looking for ? Can you tell the exact value ?
we have multiple formats available - 31.08. , 30.12.21, 29.12.2022, 24.03 in start date
In your question, you gave 4 lines/examples. Can you add the expected output ? This way, it'll be easy ;)
oh yes, I have added expected output :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.