1

I have a dataframe containing a string column:

text = ['dasndljaksdhfinwejfiv 12/05/2018'', 'akdhaouiyfbh adv12.03.2019','faytdvi advonalsdnfoaidv 5/9/2019IUAYFNVVKNVAIUEHF']

I want to extract just the dates from this and add to a new column.

I have tried the following but it just returns the original string value of the column:

df['date'] = df.text.replace({r"\d{2}[\/ ](\d{2}|January|Jan)[\/ ]\d{2,4}"},regex=True)

What's the best way to get around this?

2
  • You probably wanted df['date'] = df['text'].str.extract(r"(\d{2}[/. ](?:\d{2}|January|Jan)[/. ]\d{2}(?:\d{2})?)") Commented Feb 13, 2019 at 23:40
  • I added a more appropriate pattern below. Commented Feb 13, 2019 at 23:52

1 Answer 1

3

You may use .str.extract(), use \d{1,2} to match 1-digit days and months and add the . to the delimiters pattern:

df['date'] = df['text'].str.extract(r"(\d{1,2}[/. ](?:\d{1,2}|January|Jan)[/. ]\d{2}(?:\d{2})?)")

See the regex demo.

Also, note that \d{2}(?:\d{2})? matches 2 or 4 digits, and \d{2,4} matches 2, 3 or 4 digits.

Note that the whole regex pattern is wrapped with a capturing group, it is necessary because .str.extract requires at least one capturing group to work, it will yield just the text captured.

To match the dates not inside other digits, you may add (?<!\d) at the start and (?!\d) at the end, it may make the pattern safer.

If you plan to match any English month name you need to expand the pattern a little:

r"(?<!\d)(\d{1,2}[/. ](?:\d{1,2}|(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:t(?:ember)?)?|(?:Nov|Dec)(?:ember)?|Oct(?:ober)?))[/. ]\d{2}(?:\d{2})?)(?!\d)"

See another regex demo.

Details

  • (?<!\d) - no digit immediately to the left is allowed
  • ( - start of a capturing group
  • \d{1,2} - 1 or 2 digits
  • [/. ] - space, / or .
  • (?:\d{1,2}|(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:t(?:ember)?)?|(?:Nov|Dec)(?:ember)?|Oct(?:ober)?)) - month name, abbreviated or not
  • [/. ] - space, / or .
  • \d{2}(?:\d{2})? - 2 digits followed with an optional sequence of two digits
  • ) - end of the capturing group
  • (?!\d) - no digit immediately to the right is allowed
Sign up to request clarification or add additional context in comments.

2 Comments

How would this be modified to allow for 2 digit dates within the text as well? Such as 02/13, 2/13 etc
@user10 Add an optional group, see this demo, (?<!\d)(\d{1,2}[/. ](?:(?:\d{1,2}|(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:t(?:ember)?)?|(?:Nov|Dec)(?:ember)?|Oct(?:ober)?))[/. ])?\d{2}(?:\d{2})?)(?!\d)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.