Python: Parsing logical string into list of lists

Question

I get string of logical expressions from a database and need to put these into a list of lists for further evaluation. I already tried reading a lot about string parsing but could not find the answer so far. For easier understanding of the problem, here are 3 examples:

input_string1 = '((A OR B) AND (C OR D)) OR E'
input_string2 = '(A AND ( B OR C ) AND D AND E)'
input_string3 = ' A OR ( B AND C ) OR D OR E'

expected ouput:

Results_string1=[ ['A', 'C'], ['A','D'], ['B','C'], ['B','D'], ['E']]
Results_string2=[ ['A', 'B', 'D', 'E'], ['A', 'C', 'D', 'E'] ]
Results_string3=[ ['A'], ['B','C'], ['D'], ['E'] ]

So basically I need the fully factorized expressions in terms of OR and put those into the list. This means any AND condition is expressed by having both expressions in the same sublist, while any OR condition triggers the creation of new sublists.

e.g. E AND F --> [E, F], E OR F --> [[E],[F]]

The strings from the database have arbitrary length and an arbitrary amount of brackets.

Anyone got an idea how to define the grammar, such that I can use e.g. the pyparsing package?

The start of the grammar so far is:

import pyparsing as pp
gene_id = pp.Word(pp.alphanums)
logical = ( pp.Keyword("AND") | pp.Keyword("OR") ).setName("logical")
l_brackets = (pp.Literal('(') ).setName('l_brackets')
r_brackets = ( pp.Literal(')') ).setName('r_brackets')

But how do I have to define the real parser?

One of the main problems is that I don't know how to handle the arbitrary occurring brackets and varying length of the string. I've been playing around with the nestedExpr()-parser from the pyparser toolbox but could not create the correct behavior so far.

@tobias_k Or he could use + or * if he's willing to go for the symbols used in Discrete Mathematics. — Sean Francis N. Ballais
– Sean Francis N. Ballais, Commented Jul 8, 2016 at 15:29
So, he is just supposed to simplify the expression. I'm thinking of looping through the expression with a behavior of a lexical analyzer (?). — Sean Francis N. Ballais
– Sean Francis N. Ballais, Commented Jul 8, 2016 at 15:38
Look at the SimpleBool.py example at pyparsing.wikispaces.com/examples — PaulMcG
– PaulMcG, Commented Jul 8, 2016 at 21:53

Alden · Accepted Answer · 2016-07-08 16:56:40Z

Here is a solution that gives the result you desire using a tokenizer and recursive descent parser. Unfortunately, I am not familiar with the pyparsing library so I did not use it.

s1 = '((A OR B) AND (C OR D)) OR E'
s2 = '(A AND ( B OR C ) AND D AND E)'
s3 = ' A OR ( B AND C ) OR D OR E'

class Token:
    def __init__(self, name, value, location):
        self.name = name
        self.value = value
        self.location = location

    def __repr__(self):
        return self.name

def tokenize(s):
    index = 0
    tokens = []
    while index < len(s):
        c = s[index]

        if c == '(':
            tokens.append(Token('LPAREN', '(', index))
            index += 1
            continue
        elif c == ')':
            tokens.append(Token('RPAREN', ')', index))
            index += 1
            continue
        elif s[index:index+2] == 'OR':
            tokens.append(Token('OR', 'OR', index))
            index += 2
            continue
        elif s[index:index+3] == 'AND':
            tokens.append(Token('AND', 'AND', index))
            index += 3
            continue
        else:
            start = index
            while index < len(s) and s[index].isalpha(): 
                index += 1
            if not start == index:
                tokens.append(Token('SYMBOL', s[start:index], start))
            else:
                index += 1

    return tokens

def eval_and(left, right):
    result = []
    for l in left:
        for r in right:
            result.append(l+r)

    return result 

def eval_or(left, right):
    left.extend(right)
    return left

def parse_symbol(tokens, index):
    token = tokens[index]
    if token.name == 'SYMBOL':
        return ([[token.value]], index+1)
    else:
        raise

def parse_paren(tokens, index):
    lparen = tokens[index]
    index += 1
    if not lparen.name == 'LPAREN':
        raise

    result, index = parse_expr(tokens, index)

    rparen = tokens[index]
    index += 1
    if not rparen.name == 'RPAREN':
        raise

    return (result, index)

def parse_expr(tokens, index):
    left = None
    right = None

    token = tokens[index]
    if token.name == 'LPAREN':
        left, index = parse_paren(tokens, index)
    elif token.name == 'SYMBOL':
        left, index = parse_symbol(tokens, index)

    op = tokens[index]
    index += 1
    if not op.name == 'OR' and not op.name == 'AND':
        raise

    while True:
        token = tokens[index]
        if token.name == 'LPAREN':
            right, index = parse_paren(tokens, index)
        elif token.name == 'SYMBOL':
            right, index = parse_symbol(tokens, index)

        op = eval_or if op.name == 'OR' else eval_and
        result = op(left, right)

        continue_ = False
        if not index == len(tokens):
            next_ = tokens[index]
            if next_.name == 'OR' or next_.name == 'AND':
                continue_ = True
                op = next_

        if continue_:
            left = result
            index += 1
            continue
        else:
            return (result, index)

def parse(tokens):
    result = None

    token = tokens[0]
    if token.name == 'LPAREN':
        result, _ = parse_paren(tokens, 0)
    else:
        result, _ = parse_expr(tokens, 0)

    return result

for s in [s1, s2, s3]:
    print parse(tokenize(s))

The output is

[['A', 'C'], ['A', 'D'], ['B', 'C'], ['B', 'D']]
[['A', 'B', 'D', 'E'], ['A', 'C', 'D', 'E']]
[['A'], ['B', 'C'], ['D'], ['E']]

Be careful -- the E is missing from the output of s1. parse(tokenize('(A AND B) OR C')) -> [['A', 'B']]

tobias_k · Accepted Answer · 2016-07-08 16:14:35Z

1

For the recursive nature of the expression, you can use the Forward element, and for the variable length clauses you can use ZeroOrMore. Based on your existing grammar:

expression = pp.Forward()
atom = gene_id | pp.Group(l_brackets + expression + r_brackets)
expression << atom + pp.ZeroOrMore(logical + expression)

With this, expression.parseString results in the following for your input strings:

[['(', ['(', 'A', 'OR', 'B', ')'], 'AND', ['(', 'C', 'OR', 'D', ')'], ')'], 'OR', 'E']
[['(', 'A', 'AND', ['(', 'B', 'OR', 'C', ')'], 'AND', 'D', 'AND', 'E', ')']]
['A', 'OR', ['(', 'B', 'AND', 'C', ')'], 'OR', 'D', 'OR', 'E']

If you want to get rid of the ( and ) in the output, you should call suppress() on l_bracket and r_bracket. Given the grouping (with Group) those are not really needed. Output will then be, e.g., [['A', 'AND', ['B', 'OR', 'C'], 'AND', 'D', 'AND', 'E']] for your second string.

The conversion to DNF is another matter and should best be asked in another question.

answered Jul 8, 2016 at 16:14

tobias_k

83.1k12 gold badges130 silver badges186 bronze badges

2 Comments

KillTech Over a year ago

thank you very much, this looks really great and I'll be trying this as soon as i can! with this output, it should be easy to create the lists I'm looking for.

PaulMcG Over a year ago

Read up on pyparsing's infixNotation method to help you handle proper precedence of operations. Also see the simpleBool.py example at the pyparsing wiki.

Collectives™ on Stack Overflow

Python: Parsing logical string into list of lists

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related