2

I am wanting to verify and then parse this string (in quotes):

string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'

I would like to verify that the string starts with 'start:' and ends with ';' Afterward, I would like to have a regex parse out the strings. I tried the following python re code:

regx = r"start: (c?[0-9]+,?)+;" 
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()

I have tried different variations but I can either get the first or the last code but not a list of all three.

Or should I abandon using a regex?

EDIT: updated to reflect part of the problem space I neglected and fixed string difference. Thanks for all the suggestions - in such a short time.

1
  • Indent code 4 spaces or use the "{}" button in the post editor. I fixed it for you. BTW, did you mean "V1 OIDs" or "start"? Commented Jan 10, 2011 at 21:48

4 Answers 4

5

In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).

Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).

Sign up to request clarification or add additional context in comments.

2 Comments

Looks right to me. You can also check this out. But you can use regex to find the start: and ; and do a two step process. And you might want to check this out. stackoverflow.com/questions/1099178/…
Thanks, I wondered about regex groups and repetition for a single regex search() call. I had switched over to using findall() as well but I asked the question here just to see if there was a better way.
5

You could use the standard string tools, which are pretty much always more readable.

s = "start: c12354, c3456, 34526;"

s.startswith("start:") # returns a boolean if it starts with this string

s.endswith(";") # returns a boolean if it ends with this string

s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "

1 Comment

Yeah, I know that I could use straight string parsing but it I would have to code verifying the string format, versus with a regex you get that right off the bat.
2

This can be done (pretty elegantly) with a tool like Pyparsing:

from pyparsing import Group, Literal, Optional, Word
import string

code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
    for line in f:
        try:
            result = parser.parseString(line)
            codes = [c[1] for c in result[1:-1]]
            # Do something with teh codez...
        except ParseException exc:
            # Oh noes: string doesn't match!
            continue

Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.

3 Comments

Thanks for pitching in with a pyparsing solution! Some other options to consider: define code as Word('c'+string.digits, string.digits); then parser can just be 'start:' + delimitedList(code)("codes") + ';'; the list of codes can be accessed through the results name as codes = result.codes -- in general I would keep the definition of things like code as clean as possible, and not mess them up with things like optional comma delimiters; instead add the commas at the next higher level of parser composition. But your parser certainly gets the job done - congrats!
@Paul: Nice! Didn't know about delimitedList before now, and it totally makes sense that Literal be optional. Great stuff...thanks!
Interesting. I will have to look into pyparsing. Thanks for the post.
0
import re

sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')

mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
    res = re.findall(slst, match.group(0))

results in

['12354', '3456', '34526']

1 Comment

Thanks for coding out the answer that madmik3 suggested - very helpful.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.