0

To extract first three letters 'abc' and three sets of three-digits numbers in 000_111_222 I am using the following expression:

text = 'abc_000_111_222'
print re.findall('^[a-z]{3}_[0-9]{3}_[0-9]{3}_[0-9]{3}', text)

But the expression returns empty list when instead of underscores there are minuses or periods used instead: abc.000.111.222 or abc-000-111-222 or any combination of it like: abc_000.111-222

Sure I could use a simple replace method to unify the text variable text=text.replace('-','_').replace('.','_')

But I wonder if instead of replacing I could modify regex expression that would recognize the underscores, minuses and periods.

5
  • Replace the underscore with something that matches any of the three characters? Commented Oct 13, 2016 at 19:39
  • few more sample strings will shed more light on what you try to achieve Commented Oct 13, 2016 at 19:42
  • print re.findall(r'^[a-z]{3}(?:[_.-]\d{3}){3}$', text) should work Commented Oct 13, 2016 at 19:46
  • @anubhava, the pattern can be repeated only twice since the string doesn't end with _ Commented Oct 13, 2016 at 19:51
  • No, there is a _ before first number as well. Commented Oct 13, 2016 at 19:53

2 Answers 2

3

You can use regex character classes with [...]. For your case, it can be [_.-] (note the hyphen at the end, if it isn't at the end, it will be considered as a range like [a-z]).

You can use a regex like this:

print re.findall('^[a-z]{3}[_.-][0-9]{3}[_.-][0-9]{3}[_.-][0-9]{3}', text)

enter image description here

Btw, you can shorten your regex to have something like this:

print re.findall('^[a-z]{3}[_.-](\d{3}[_.-]){2}\d{3}', text)

Just as a comment, in case you want to match the same separator, then you can use capture groups and reference its content like this:

^[a-z]{3}([_.-])[0-9]{3}\1[0-9]{3}\1[0-9]{3}
Sign up to request clarification or add additional context in comments.

2 Comments

Terrific answer! Thanks!
@alphanumeric glad to help
-1

Why not abandon regexes altogether, and use a clearer and simpler solution?

$ cat /tmp/tmp.py
SEP = '_.,;-=+'

def split_str(text):
    for s in list(SEP):
        res = text.split(s)
        if len(res) > 1:
            return text.split(s)

print(split_str('abc_000_111_222'))
print(split_str('abc;000;111;222'))
print(split_str('abc.000.111.222'))
print(split_str('abc-000-111-222'))

Which gives:

$ python3 /tmp/tmp.py
['abc', '000', '111', '222']
['abc', '000', '111', '222']
['abc', '000', '111', '222']
['abc', '000', '111', '222']

$

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.