I want to parse a string with a format like this:
[{text1}]{quantity}[{text2}]
This rule means that at the beginning there is some text that can optionally be there or not, followed by a {quantity} whose syntax I describe just below, followed by more optional text.
The {quantity} can take a variety of forms, with {n} being any positive integer
{n}
{n}PCS
{n}PC
{n}PCS.
{n}PC.
Lot of {n}
Also, it should accept this additional rule:
{n} {text2}
In this rule, {n} is followed by a space then {text2}
In the cases where PC or PCS appears
- it may or may not be followed by a dot
- case insensitive
- a space can optionally appear between {n} and PCS
- The following are all stripped: PC or PCS, the optional dot, and the optional space
The desired output is normalized to two variables:
- {n} as an integer
- [{text1}] [{text2}], that is, first {text1} (if present), then a space, then {text2} (if present), concatenated to one string. A space to separate the text pieces is only used if there are two of them.
If the {quantity} includes anything besides a positive integer, {n} consists only of the the integer, and the rest of {quantity} (e.g. " PCS.") is stripped from both {n} and the resultant text string.
In the text parts, more integers could appear. Any other than the {quantity} found should be regarded as just part of the text, not interpreted as another quantity.
I am a former C/C++ programmer. If I had to solve this with those languages, I would probably use rules in lex and yacc, or else I would have to write a lot of nasty code to hand-parse it.
I would like to learn a clean approach for coding this efficiently in Python, probably using rules in some form to easily support more cases. I think I could use lex and yacc with Python, but I wonder if there is an easier way. I'm a Python newbie; I don't even know where to start with this.
I am not asking anyone to write code for a complete solution, rather, I need an approach or two, and perhaps some sample code showing part of how to do it.