python regex for repeating string

Question

I am wanting to verify and then parse this string (in quotes):

string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'

I would like to verify that the string starts with 'start:' and ends with ';' Afterward, I would like to have a regex parse out the strings. I tried the following python re code:

regx = r"start: (c?[0-9]+,?)+;" 
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()

I have tried different variations but I can either get the first or the last code but not a list of all three.

Or should I abandon using a regex?

EDIT: updated to reflect part of the problem space I neglected and fixed string difference. Thanks for all the suggestions - in such a short time.

Indent code 4 spaces or use the "{}" button in the post editor. I fixed it for you. BTW, did you mean "V1 OIDs" or "start"? — Jim Garrison
– Jim Garrison, Commented Jan 10, 2011 at 21:48

Konrad Rudolph · Accepted Answer · 2011-01-10 21:50:03Z

5

In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).

Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).

answered Jan 10, 2011 at 21:50

Konrad Rudolph

549k142 gold badges967 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

madmik3 Over a year ago

Looks right to me. You can also check this out. But you can use regex to find the start: and ; and do a two step process. And you might want to check this out. stackoverflow.com/questions/1099178/…

Lars Nordin Over a year ago

Thanks, I wondered about regex groups and repetition for a single regex search() call. I had switched over to using findall() as well but I asked the question here just to see if there was a better way.

Donald Miner · Accepted Answer · 2011-01-10 21:51:19Z

5

You could use the standard string tools, which are pretty much always more readable.

s = "start: c12354, c3456, 34526;"

s.startswith("start:") # returns a boolean if it starts with this string

s.endswith(";") # returns a boolean if it ends with this string

s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "

answered Jan 10, 2011 at 21:51

Donald Miner

40.1k10 gold badges99 silver badges118 bronze badges

1 Comment

Lars Nordin Over a year ago

Yeah, I know that I could use straight string parsing but it I would have to code verifying the string format, versus with a regex you get that right off the bat.

elo80ka · Accepted Answer · 2011-01-12 07:30:54Z

2

This can be done (pretty elegantly) with a tool like Pyparsing:

from pyparsing import Group, Literal, Optional, Word
import string

code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
    for line in f:
        try:
            result = parser.parseString(line)
            codes = [c[1] for c in result[1:-1]]
            # Do something with teh codez...
        except ParseException exc:
            # Oh noes: string doesn't match!
            continue

Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.

answered Jan 12, 2011 at 7:30

elo80ka

16.2k3 gold badges39 silver badges43 bronze badges

3 Comments

PaulMcG Over a year ago

Thanks for pitching in with a pyparsing solution! Some other options to consider: define code as Word('c'+string.digits, string.digits); then parser can just be 'start:' + delimitedList(code)("codes") + ';'; the list of codes can be accessed through the results name as codes = result.codes -- in general I would keep the definition of things like code as clean as possible, and not mess them up with things like optional comma delimiters; instead add the commas at the next higher level of parser composition. But your parser certainly gets the job done - congrats!

elo80ka Over a year ago

@Paul: Nice! Didn't know about delimitedList before now, and it totally makes sense that Literal be optional. Great stuff...thanks!

Lars Nordin Over a year ago

Interesting. I will have to look into pyparsing. Thanks for the post.

Hugh Bothwell · Accepted Answer · 2011-01-11 00:53:24Z

0

import re

sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')

mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
    res = re.findall(slst, match.group(0))

results in

['12354', '3456', '34526']

answered Jan 11, 2011 at 0:53

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

1 Comment

Lars Nordin Over a year ago

Thanks for coding out the answer that madmik3 suggested - very helpful.

Collectives™ on Stack Overflow

python regex for repeating string

4 Answers 4

2 Comments

1 Comment

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related