4

I'm a newbie and sure this is something silly in my code. In my defense I've tried re-reading through the Python RE documentation here before asking and searching around but don't see a duplicate question so far (which surprised me.)

Outside of a DataFrame I have my re working example here:

x = 'my best friend's birthday is 24 Jan 2001.'
print(re.findall('\d{1,2}\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s\d{2,4}', x))
<Anaconda console returns:> 24 Jan 2001

But in my Dataframe (df1) I have the following:

index     text
0         My birthday is 2/21/19
1         Your birthday is 4/1/20
2         my best friend's birthday is 24 Jan 2001.   

When I run the following code:

df1['dates'] = df1['text'].str.extract('.*?(\d+[/-]\d+[/-]?\d*).*?|\d{1,2}\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s\d+')
print('df1['dates'])

I get the following results:

     dates
0    2/21/19
1    4/1/20
2    NaN

I've tried to play around with the parenthesis, rereading the documentation, and some other tweaks that just resulted in endless errors. I'm sure it's an obvious mistake, but I don't see it. Can someone help? Thank you.

1 Answer 1

1

You have to have a capture group when using .extract() in pandas. Your capture group before the OR, |, is finding the dates with slashes. But after the OR, you only have a non-capture group.

Here I have placed a capture around the entire search pattern, and each side of the OR also has a non-capturing group.

import pandas as pd

df = pd.DataFrame({'text': ['My birthday is 2/21/19', 
    'Your birthday is 4/1/20', 
    'my best friend\'s birthday is 24 Jan 2001.']})

df.text.str.extract(
    r'((:?\d+[/-]\d+[/-]?\d*)|' + 
    r'(:?\d{1,2}\s(:?Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s\d+))', 
    expand=False)[0]

# returns:
0        2/21/19
1         4/1/20
2    24 Jan 2001
Sign up to request clarification or add additional context in comments.

1 Comment

James, I added one closing parenthesis to your code in the first re statement in the extract to get this to work as expected. Your answer helped me tremendously, thank you: r'((:?(\d+[/-]\d+[/-]?\d*))|' +

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.