0

I want to extract the below pattern from the dataframe:

Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009

I have written the below code to extract it:

d4=df.str.extractall(r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z][?:]*)((?:\d{1,2}(?:th|st|nd|rd)[,?:])\d{4})')

Unfortunately, it is not able to extract anything.

2
  • Expected result! Commented Feb 9, 2018 at 1:44
  • Mar 20th, 2009,Mar 21st, 2009,Mar 22nd, 2009 Commented Feb 9, 2018 at 1:46

4 Answers 4

2

I assume that your date format would only be: MMM DDst/nd/rd/th, YYYY, thus Mar 01st, 2009 instead of Mar 1st, 2009. The following regex should work well. \b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (?:[0-3][1]st|[0-2][2]nd|[0-2][3]rd|[1-3][0]th|[0-2][4-9]th), \d{4}

Python Regex Demo

Sign up to request clarification or add additional context in comments.

Comments

1

I saw multiple problems/doubts with your pattern, so I just rewrote it from the start as this:

(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}(?:th|st|nd|rd),\s+\d{4}

Here is an explanation of the pattern:

(?:Jan|Feb|...|Dec)    match, but do not capture, the abbreviated month name
\s+                    one or more spaces
\d{1,2}                day as one or two digits
(?:th|st|nd|rd)        match, but do not capture, day quantifier
\s+                    one or more spaces
\d{4}                  match a four digit year

Full code:

my_str = 'Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009'

match = re.findall(r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}(?:th|st|nd|rd),\s+\d{4}', my_str)

for item in match:
    print(item)

Demo

Comments

0

It needs some whitespaces.

((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\s+((?:\d{1,2}(?:th|st|nd|rd)[,?:])\s+\d{4})

 (                             # (1 start)
      (?:
           Jan
        |  Feb
        |  Mar
        |  Apr
        |  May
        |  Jun
        |  Jul
        |  Aug
        |  Sep
        |  Oct
        |  Nov
        |  Dec
      )
 )                             # (1 end)
 \s+ 
 (                             # (2 start)
      (?:
           \d{1,2} 
           (?: th | st | nd | rd )
           [,?:] 
      )
      \s+ 
      \d{4} 
 )                             # (2 end)

Comments

0

You can use re.split.

Regex: ;\s

Details:

  • \s Matches any whitespace character

Python code:

def Split(text):
        return re.split(r';\s', text)

print(Split("Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009"))

Output:

['Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009;']

Code demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.