Parsing semi structured text strings in Python

Question

I am trying to parse pseudo-English scripts, and want to convert it into another machine readable language. However the script have been written by many people in the past, and each had their own style of writing.

some Examples would be:

On Device 1 Set word 45 and 46 to hex 331
On Device 1 set words 45 and 46 bits 3..7 to 280
on Device 1 set word 45 to oct 332
on device 1 set speed to 60kts Words 3-4 to hex 34 (there are many more different ways used in the source text)

The issue is its not always logical nor consistent

I have looked at Regexp, and matching certain words. This works out ok, but when I need to know the next word (e.g in 'Word 24' I would match for 'Word' then try to figure out if the next token is a number or not). In the case of 'Words' i need to look for the words to set, as well as their values.

in example 1, it should produce to Set word 45 to hex 331 and Set word 46 to hex 331 or if possible Set word 45 to hex 331 and word 46 to hex 331

i tried using the findall method on re - that would only give me the matched words, and then i have to try to find out the next word (i.e value) manually

alternatively, i could split the string using a space and process each word manually, then be able to do something like

assuming list is

['On', 'device1:', 'set', 'Word', '1', '', 'to', '88', 'and', 'word', '2', 'to', '2151']

for i in range (0,sp.__len__()):
    rew = re.search("[Ww]ord", sp[i])
    if rew:
        print ("Found word, next val is ", sp[i+1])

is there a better way to do what i want? i looked a little bit into tokenizing, but not sure that would work as the language is not structured in the first place.

Try to avoid calling dunder-methods directly, use len(sp) instead — Paul Evans
– Paul Evans, Commented Mar 1, 2019 at 11:30
if writing same, you can do lowercase or uppercase all. It will help, — Akhilesh_IN
– Akhilesh_IN, Commented Mar 1, 2019 at 11:34

Michael Dyck · Accepted Answer · 2019-03-01 16:58:54Z

1

I suggest you develop a program that gradually explores the syntax that people have used to write the scripts.

E.g., each instruction in your examples seems to break down into a device-part and a settings-part. So you could try matching each line against the regex ^(.+) set (.+). If you find lines that don't match that pattern, print them out. Examine the output, find a general pattern that matches some of them, add a corresponding regex to your program (or modify an existing regex), and repeat. Proceed until you've recognized (in a very general way) every line in your input.

(Since capitalization appears to be inconsistent, you can either do case-insensitive matches, or convert each line to lowercase before you start processing it. More generally, you may find other 'normalizations' that simplify subsequent processing. E.g., if people were inconsistent about spaces, you can convert every run of whitespace characters into a single space.)

(If your input has typographical errors, e.g. someone wrote "ste" for "set", then you can either change the regex to allow for that (... (set|ste) ...), or go to (a copy of) the input file and just fix the typo.)

Then go back to the lines that matched ^(.+) set (.+), print out just the first group for each, and repeat the above process for just those substrings. Then repeat the process for the second group in each "set" instruction. And so on, recursively.

Eventually, your program will be, in effect, a parser for the script language. At that point, you can start to add code to convert each recognized construct into the output language.

Depending on your experience with Python, you can find ways to make the code concise.

answered Mar 1, 2019 at 16:58

Michael Dyck

2,5681 gold badge18 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ammo Over a year ago

Thanks for the input. Thats currently what I am doing. I am matching against typical syntax using regexp. I was hoping to see another method of doing this, since as you mention, i have to match for all cases, with text in the wrong order - such as "... to 30 hex" or "... to hex 30"

Michael Dyck Over a year ago

Yes, you're using regexes, but it sounded to me like you were using them to just "fish for keywords", not as part of a top-down, recursive, recognize-everything approach such as I suggested. Certainly, if you're familiar with writing grammars, and you have a good parsing tool, then that's another method, but you're still going to have to explicitly deal with "text in the wrong order" -- either allowing for it in the grammar or normalizing it away in a preprocessing step.

Jan · Accepted Answer · 2019-03-01 22:26:17Z

Depending on what you actually want from these strings, you could use a parser, e.g. parsimonious:

from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar

grammar = Grammar(
    r"""
    command     = set operand to? number (operator number)* middle? to? numsys? number
    operand     = (~r"words?" / "speed") ws
    middle      = (~r"[Ww]ords" / "bits")+ ws number
    to          = ws "to" ws
    number      = ws ~r"[-\d.]+" "kts"? ws
    numsys      = ws ("oct" / "hex") ws
    operator    = ws "and" ws
    set         = ~"[Ss]et" ws
    ws          = ~r"\s*"
    """
)

class HorribleStuff(NodeVisitor):
    def __init__(self):
        self.cmds = []

    def generic_visit(self, node, visited_children):
        pass

    def visit_operand(self, node, visited_children):
        self.cmds.append(('operand', node.text))

    def visit_number(self, node, visited_children):
        self.cmds.append(('number', node.text))


examples = ['Set word 45 and 46 to hex 331',
            'set words 45 and 46 bits 3..7 to 280',
            'set word 45 to oct 332',
            'set speed to 60kts Words 3-4 to hex 34']


for example in examples:
    tree = grammar.parse(example)
    hs = HorribleStuff()
    hs.visit(tree)
    print(hs.cmds)

This would yield

[('operand', 'word '), ('number', '45 '), ('number', '46 '), ('number', '331')]
[('operand', 'words '), ('number', '45 '), ('number', '46 '), ('number', '3..7 '), ('number', '280')]
[('operand', 'word '), ('number', '45 '), ('number', '332')]
[('operand', 'speed '), ('number', '60kts '), ('number', '3-4 '), ('number', '34')]

Collectives™ on Stack Overflow

Parsing semi structured text strings in Python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related