Extract pattern using Regex in Python

Question

I want to extract the below pattern from the dataframe:

Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009

I have written the below code to extract it:

d4=df.str.extractall(r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z][?:]*)((?:\d{1,2}(?:th|st|nd|rd)[,?:])\d{4})')

Unfortunately, it is not able to extract anything.

Expected result!

Srdjan M.
– Srdjan M.

2018-02-09 01:44:52 +00:00
Commented Feb 9, 2018 at 1:44 — Srdjan M.
– Srdjan M., Commented Feb 9, 2018 at 1:44
Mar 20th, 2009,Mar 21st, 2009,Mar 22nd, 2009

user15051990
– user15051990

2018-02-09 01:46:06 +00:00
Commented Feb 9, 2018 at 1:46 — user15051990
– user15051990, Commented Feb 9, 2018 at 1:46

Yung · Accepted Answer · 2018-02-09 02:34:08Z

2

I assume that your date format would only be: MMM DDst/nd/rd/th, YYYY, thus Mar 01st, 2009 instead of Mar 1st, 2009. The following regex should work well. \b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (?:[0-3][1]st|[0-2][2]nd|[0-2][3]rd|[1-3][0]th|[0-2][4-9]th), \d{4}

Python Regex Demo

answered Feb 9, 2018 at 2:34

Yung

1764 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Tim Biegeleisen · Accepted Answer · 2018-02-09 01:46:41Z

I saw multiple problems/doubts with your pattern, so I just rewrote it from the start as this:

(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}(?:th|st|nd|rd),\s+\d{4}

Here is an explanation of the pattern:

(?:Jan|Feb|...|Dec)    match, but do not capture, the abbreviated month name
\s+                    one or more spaces
\d{1,2}                day as one or two digits
(?:th|st|nd|rd)        match, but do not capture, day quantifier
\s+                    one or more spaces
\d{4}                  match a four digit year

Full code:

my_str = 'Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009'

match = re.findall(r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}(?:th|st|nd|rd),\s+\d{4}', my_str)

for item in match:
    print(item)

Demo

user557597 · Accepted Answer · 2018-02-09 01:47:18Z

0

It needs some whitespaces.

((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\s+((?:\d{1,2}(?:th|st|nd|rd)[,?:])\s+\d{4})

 (                             # (1 start)
      (?:
           Jan
        |  Feb
        |  Mar
        |  Apr
        |  May
        |  Jun
        |  Jul
        |  Aug
        |  Sep
        |  Oct
        |  Nov
        |  Dec
      )
 )                             # (1 end)
 \s+ 
 (                             # (2 start)
      (?:
           \d{1,2} 
           (?: th | st | nd | rd )
           [,?:] 
      )
      \s+ 
      \d{4} 
 )                             # (2 end)

answered Feb 9, 2018 at 1:47

user557597

Comments

Srdjan M. · Accepted Answer · 2018-02-09 02:13:14Z

0

You can use re.split.

Regex: ;\s

Details:

\s Matches any whitespace character

Python code:

def Split(text):
        return re.split(r';\s', text)

print(Split("Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009"))

Output:

['Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009;']

Code demo

edited Feb 9, 2018 at 2:13

answered Feb 9, 2018 at 1:50

Srdjan M.

3,4253 gold badges17 silver badges35 bronze badges

Collectives™ on Stack Overflow

Extract pattern using Regex in Python

4 Answers 4

Comments

Demo

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related