0
my_string = "Value1=Product Registered;Value2=Linux;Value3=C:5;C++:5;Value4=43;"

I was using the following regex:

tokens = re.findall(r'([^;]+)=([^;]+)', line, re.I)

I need to parse value1, value2, etc and put their values into the database. For example, I need to store "C:5;C++:5" for value3 -- but by using the above regex I can only store C:5, because I parse based on ";". What would be a better way to do this?

Thanks!

8
  • 1
    Do the fields always start with the string "Value"? Commented Jul 12, 2012 at 23:40
  • 2
    This looks like a weird language. If one of the RHSs is e.g. ;Value=foo it's undecidable. Or is there some constraint on the LHS and RHS? Commented Jul 12, 2012 at 23:42
  • 1
    my bad it is Value4=43. The real problem is value3. How do i parse that? Commented Jul 12, 2012 at 23:48
  • 1
    Is there a reason you have to use a regex for this? If the keys and values can't contain semicolons, you can just do my_string.split(';') (or [kv.split('=') for kv in my_string.split(';')] if you want pairs). If they can contain semicolons, then regexes won't work either. Commented Jul 12, 2012 at 23:54
  • 1
    How is any parser supposed to know whether this is "Value3" = "C:5;C++:5" or "Value3" = "C:5" and "C++:5;Value4" = "43"? If you can quantify the answer to that, someone can tell you how to turn that answer into code. If not, the language is ambiguous. Commented Jul 12, 2012 at 23:57

1 Answer 1

3

It seems reasonable to assume that the key names don't contain semicolons. If this isn't true, then as Philipp pointed out the language is ambiguous. But if not, you can use a lookahead to tell which ; is the separator: it has to be followed by a sequence of things that aren't either ; or =, and then either an = or end-of-string:

>>> my_string = "Value1=Product Registered;Value2=Linux;Value3=C:5;C++:5;Value4=43;"
>>> r = re.compile(r'([^;]+)=([^=]+);(?=[^;=]*(?:=|$))')
>>> r.findall(my_string)
[('Value1', 'Product Registered'),
 ('Value2', 'Linux'),
 ('Value3', 'C:5;C++:5'),
 ('Value4', '43')]
Sign up to request clarification or add additional context in comments.

4 Comments

Unfortunately, as the OP made clear in one of his edits, the values can contain semicolons, which means the language can't be parsed by regex (or, in fact, anything—unless there's some extra rule here, it's ambiguous).
@abarnert By "values" I meant the things like "value1". I'll rephrase, but the code shows that it clearly does work for his example.
OK, right, if the values can contain semicolons but the keys can't, then it's not ambiguous, and it's parseable by your regex. And I think your answer now makes it clear what exactly the assumptions are. So, +1.
@JBernardo Sure; switched it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.