Complex parsing of a string in Python

Question

I want to parse a string with a format like this:

[{text1}]{quantity}[{text2}]

This rule means that at the beginning there is some text that can optionally be there or not, followed by a {quantity} whose syntax I describe just below, followed by more optional text.

The {quantity} can take a variety of forms, with {n} being any positive integer

{n}
{n}PCS
{n}PC
{n}PCS.
{n}PC.
Lot of {n}

Also, it should accept this additional rule:

{n} {text2}

In this rule, {n} is followed by a space then {text2}

In the cases where PC or PCS appears

it may or may not be followed by a dot
case insensitive
a space can optionally appear between {n} and PCS
The following are all stripped: PC or PCS, the optional dot, and the optional space

The desired output is normalized to two variables:

{n} as an integer
[{text1}] [{text2}], that is, first {text1} (if present), then a space, then {text2} (if present), concatenated to one string. A space to separate the text pieces is only used if there are two of them.

If the {quantity} includes anything besides a positive integer, {n} consists only of the the integer, and the rest of {quantity} (e.g. " PCS.") is stripped from both {n} and the resultant text string.

In the text parts, more integers could appear. Any other than the {quantity} found should be regarded as just part of the text, not interpreted as another quantity.

I am a former C/C++ programmer. If I had to solve this with those languages, I would probably use rules in lex and yacc, or else I would have to write a lot of nasty code to hand-parse it.

I would like to learn a clean approach for coding this efficiently in Python, probably using rules in some form to easily support more cases. I think I could use lex and yacc with Python, but I wonder if there is an easier way. I'm a Python newbie; I don't even know where to start with this.

I am not asking anyone to write code for a complete solution, rather, I need an approach or two, and perhaps some sample code showing part of how to do it.

The first question to figure out is if your language is context free or not. That will determine if you can use a regex or similar tool. If you can't then honestly yacc is the default tools for the job. There may be a python specific package for yacc, but the original works just as well :-) — intentionally-left-nil
– intentionally-left-nil, Commented Jun 15, 2016 at 19:49
PLY is the lex-yacc of Python, but pyparsing may be simpler to get out of the gate. — PaulMcG
– PaulMcG, Commented Jun 15, 2016 at 19:51
I think it is context free, in the sense that there will be isolated one-line items that are analyzed separately. So you are suggesting regex for context-free data lines? — Mark Colan
– Mark Colan, Commented Jun 15, 2016 at 19:53

PaulMcG · Accepted Answer · 2016-06-16 00:06:16Z

3

Pyparsing let's you build up a parser by stitching together smaller parsers using '+' and '|' operators (among others). You can also attach names to the individual elements in the parser, to make it easier to get at the values afterward.

from pyparsing import (pyparsing_common, CaselessKeyword, Optional, ungroup, restOfLine, 
    oneOf, SkipTo)

int_qty = pyparsing_common.integer

# compose an expression for the quantity, in its various forms
"""
{n}
{n}PCS
{n}PC
{n}PCS.
{n}PC.
Lot of {n}
"""
LOT = CaselessKeyword("lot")
OF = CaselessKeyword("of")
pieces = oneOf("PC PCS PC. PCS.", caseless=True)
qty_expr = Optional(LOT + OF).suppress() + int_qty("qty") + Optional(pieces).suppress()

# compose expression for entire line
line_expr = SkipTo(qty_expr)("text1") + qty_expr + restOfLine("text2")

tests = """
    Send me 1000 widgets pronto!
    Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
    My shipment was short by 25 pcs.
    """

line_expr.runTests(tests)

Prints:

Send me 1000 widgets pronto!
['Send me', 1000, ' widgets pronto!']
- qty: 1000
- text1: ['Send me']
- text2:  widgets pronto!


Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
['Deliver a ', 50, ' barrels of maple syrup by Monday, June 10.']
- qty: 50
- text1: ['Deliver a ']
- text2:  barrels of maple syrup by Monday, June 10.


My shipment was short by 25 pcs.
['My shipment was short by', 25, '']
- qty: 25
- text1: ['My shipment was short by']
- text2:

EDIT: Pyparsing supports two forms of alternatives for matching: MatchFirst, which stops on the first matched alternative (which is defined using the '|' operator), and Or, which evaluates all alternatives and selects the longest match (defined using '^' operator). So if you need a priority of the quantity expression, then you define it explicitly:

qty_pcs_expr = int_qty("qty") + White().suppress() + pieces.suppress()
qty_expr = Optional(LOT + OF).suppress() + int_qty("qty") + FollowedBy(White())

# compose expression for entire line
line_expr = (SkipTo(qty_pcs_expr)("text1") + qty_pcs_expr + restOfLine("text2") |
             SkipTo(qty_expr)("text1") + qty_expr + restOfLine("text2"))

Here are the new tests:

tests = """
    Send me 1000 widgets pronto!
    Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
    My shipment was short by 25 pcs.
    2. I expect 22 pcs delivered in the morning
    On May 15 please deliver 1000 PCS.
    """

Giving:

2. I expect 22 pcs delivered in the morning
['2. I expect ', 22, ' delivered in the morning']
- qty: 22
- text1: ['2. I expect ']
- text2:  delivered in the morning


On May 15 please deliver 1000 PCS.
['On May 15 please deliver ', 1000, '']
- qty: 1000
- text1: ['On May 15 please deliver ']
- text2:

edited Jun 16, 2016 at 0:06

answered Jun 15, 2016 at 20:04

PaulMcG

64.1k16 gold badges98 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mark Colan Over a year ago

Wondering: what happens with a string that did not fit my rules? Is there some way to throw an exception?

mtadd Over a year ago

What a great declarative API. So much more readable than the regex solution I developed.

PaulMcG Over a year ago

@MarkColan - yes, pyparsing will raise a ParseException if there is no match to the input string.

Mark Colan Over a year ago

I have a problem that will require manually editing some records. An integer should be taken as a quantity only when followed by a space. In your example, a solo integer (ie, not the PCS rules) is taken as the quantity even when, there is no space after it. How to fix that? Also, is there a way to prioritize the {quantity} rules, such as always choosing the "100 PCs" or "100PCS" form if present, even if it is at the end?

Jeremy · Accepted Answer · 2016-06-15 20:13:56Z

1

I don't know if you want to use re, but here's a regular expression which I think works. You can change the str value to test it. The match returns a tuple which has the three values [{text1}]{quantity}[{text2}]. The first and last items in the tuple will be empty if text1 and text2 are missing.

import re

str = "aSOETIHSIBSROG1PCS.ecsrGIR"

matchObj = re.search(r'([a-zA-Z]+|)(\dPCS?\.?|Lot of \d)([a-zA-Z]+|)',str).groups()
print matchObj.groups()

#Output
('aSOETIHSIBSROG', '1PCS.', 'ecsrGIR')

answered Jun 15, 2016 at 20:13

Jeremy

8286 silver badges19 bronze badges

Comments

mtadd · Accepted Answer · 2016-06-15 20:47:15Z

Here's a rules processor using regex to match your two cases. I create a custom match result class to hold relevant extracted values from the input string. The rules processor tries the following rules in succession:

rule1 - tries to match {n} followed by one of pc, pc., pcs, or pcs.
rule2 - tries to match {n} prefaced by "lot of"
rule3 - matches {n} followed by {text2}

when run, results in

abc 23 PCS. def
amount=23 qtype=PCS. text1="abc" text2="def" rule=1
abc 23pc def
amount=23 qtype=pc text1="abc" text2="def" rule=1
abc 24pc.def
amount=24 qtype=pc. text1="abc" text2="def" rule=1
abc 24 pcs def
amount=24 qtype=pcs text1="abc" text2="def" rule=1
abc lot of 24 def
amount=24 qtype=lot of text1="abc" text2="def" rule=2
3 abcs
amount=3 qtype=None text1="" text2="abcs" rule=3

import re

class Match:
    def __init__(self, amount, qtype, text1, text2, rule):
        self.amount = int(amount)
        self.qtype = qtype
        self.text1 = text1
        self.text2 = text2
        self.rule = rule

    def __str__(self):
        return 'amount={} qtype={} text1="{}" text2="{}" rule={}'.format(
            self.amount, self.qtype, self.text1, self.text2, self.rule)

#{n} pc pc. pcs pcs.
def rule1(s):
    m = re.search("\s*(?P\d+)\s*(?PPCS?\.?)\s*", s, re.IGNORECASE)
    if m:
        return Match(m.group('amount'), m.group('qtype'),
                     text1=s[:m.start()], text2=s[m.end():], rule=1)
    return None

#lot of {n}
def rule2(s):
    m = re.search("\s*lot of\s*(?P\d+)\s*", s, re.IGNORECASE)
    if m:
        return Match(m.group('amount'), 'lot of',
                     text1=s[:m.start()], text2=s[m.end():], rule=2)
    return None

#{n} {text2}
def rule3(s):
    m = re.search("\s*(?P\d+)\s*",s)
    if m:
        return Match(m.group('amount'), None,
                     text1=s[:m.start()], text2=s[m.end():], rule=3)
    return None

RULES = [rule1, rule2, rule3]

def process(s):
    for rule in RULES:
        m = rule(s)
        if m: return m
    return None


tests = [
"abc 23 PCS. def",
"abc 23pc def",
"abc 24pc.def",
"abc 24 pcs def",
"abc lot of 24 def",
"3 abcs"
]


for t in tests:
    m = process(t)
    print(t)
    print(m)

Collectives™ on Stack Overflow

Complex parsing of a string in Python

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related