How to use multiple token using Regex Expression

Question

To extract first three letters 'abc' and three sets of three-digits numbers in 000_111_222 I am using the following expression:

text = 'abc_000_111_222'
print re.findall('^[a-z]{3}_[0-9]{3}_[0-9]{3}_[0-9]{3}', text)

But the expression returns empty list when instead of underscores there are minuses or periods used instead: abc.000.111.222 or abc-000-111-222 or any combination of it like: abc_000.111-222

Sure I could use a simple replace method to unify the text variable text=text.replace('-','_').replace('.','_')

But I wonder if instead of replacing I could modify regex expression that would recognize the underscores, minuses and periods.

Replace the underscore with something that matches any of the three characters? — jonrsharpe
– jonrsharpe, Commented Oct 13, 2016 at 19:39
few more sample strings will shed more light on what you try to achieve — agg3l
– agg3l, Commented Oct 13, 2016 at 19:42
print re.findall(r'^[a-z]{3}(?:[_.-]\d{3}){3}$', text) should work — anubhava
– anubhava, Commented Oct 13, 2016 at 19:46
@anubhava, the pattern can be repeated only twice since the string doesn't end with _ — Federico Piazza
– Federico Piazza, Commented Oct 13, 2016 at 19:51

Federico Piazza · Accepted Answer · 2016-10-13 19:47:46Z

3

You can use regex character classes with [...]. For your case, it can be [_.-] (note the hyphen at the end, if it isn't at the end, it will be considered as a range like [a-z]).

You can use a regex like this:

print re.findall('^[a-z]{3}[_.-][0-9]{3}[_.-][0-9]{3}[_.-][0-9]{3}', text)

Btw, you can shorten your regex to have something like this:

print re.findall('^[a-z]{3}[_.-](\d{3}[_.-]){2}\d{3}', text)

Just as a comment, in case you want to match the same separator, then you can use capture groups and reference its content like this:

^[a-z]{3}([_.-])[0-9]{3}\1[0-9]{3}\1[0-9]{3}

edited Oct 13, 2016 at 19:47

answered Oct 13, 2016 at 19:40

Federico Piazza

31.2k15 gold badges91 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

alphanumeric Over a year ago

Terrific answer! Thanks!

Federico Piazza Over a year ago

@alphanumeric glad to help

boardrider · Accepted Answer · 2016-10-14 20:00:56Z

-1

Why not abandon regexes altogether, and use a clearer and simpler solution?

$ cat /tmp/tmp.py
SEP = '_.,;-=+'

def split_str(text):
    for s in list(SEP):
        res = text.split(s)
        if len(res) > 1:
            return text.split(s)

print(split_str('abc_000_111_222'))
print(split_str('abc;000;111;222'))
print(split_str('abc.000.111.222'))
print(split_str('abc-000-111-222'))

Which gives:

$ python3 /tmp/tmp.py
['abc', '000', '111', '222']
['abc', '000', '111', '222']
['abc', '000', '111', '222']
['abc', '000', '111', '222']

$

answered Oct 14, 2016 at 20:00

boardrider

6,2557 gold badges59 silver badges92 bronze badges

Collectives™ on Stack Overflow

How to use multiple token using Regex Expression

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related