2

I get string of logical expressions from a database and need to put these into a list of lists for further evaluation. I already tried reading a lot about string parsing but could not find the answer so far. For easier understanding of the problem, here are 3 examples:

input_string1 = '((A OR B) AND (C OR D)) OR E'
input_string2 = '(A AND ( B OR C ) AND D AND E)'
input_string3 = ' A OR ( B AND C ) OR D OR E'

expected ouput:

Results_string1=[ ['A', 'C'], ['A','D'], ['B','C'], ['B','D'], ['E']]
Results_string2=[ ['A', 'B', 'D', 'E'], ['A', 'C', 'D', 'E'] ]
Results_string3=[ ['A'], ['B','C'], ['D'], ['E'] ]

So basically I need the fully factorized expressions in terms of OR and put those into the list. This means any AND condition is expressed by having both expressions in the same sublist, while any OR condition triggers the creation of new sublists.

e.g. E AND F --> [E, F], E OR F --> [[E],[F]]

The strings from the database have arbitrary length and an arbitrary amount of brackets.

Anyone got an idea how to define the grammar, such that I can use e.g. the pyparsing package?

The start of the grammar so far is:

import pyparsing as pp
gene_id = pp.Word(pp.alphanums)
logical = ( pp.Keyword("AND") | pp.Keyword("OR") ).setName("logical")
l_brackets = (pp.Literal('(') ).setName('l_brackets')
r_brackets = ( pp.Literal(')') ).setName('r_brackets')

But how do I have to define the real parser?

One of the main problems is that I don't know how to handle the arbitrary occurring brackets and varying length of the string. I've been playing around with the nestedExpr()-parser from the pyparser toolbox but could not create the correct behavior so far.

4
  • 1
    Please show us what you tried so far. Commented Jul 8, 2016 at 15:10
  • @tobias_k Or he could use + or * if he's willing to go for the symbols used in Discrete Mathematics. Commented Jul 8, 2016 at 15:29
  • So, he is just supposed to simplify the expression. I'm thinking of looping through the expression with a behavior of a lexical analyzer (?). Commented Jul 8, 2016 at 15:38
  • Look at the SimpleBool.py example at pyparsing.wikispaces.com/examples Commented Jul 8, 2016 at 21:53

2 Answers 2

3

Here is a solution that gives the result you desire using a tokenizer and recursive descent parser. Unfortunately, I am not familiar with the pyparsing library so I did not use it.

s1 = '((A OR B) AND (C OR D)) OR E'
s2 = '(A AND ( B OR C ) AND D AND E)'
s3 = ' A OR ( B AND C ) OR D OR E'

class Token:
    def __init__(self, name, value, location):
        self.name = name
        self.value = value
        self.location = location

    def __repr__(self):
        return self.name

def tokenize(s):
    index = 0
    tokens = []
    while index < len(s):
        c = s[index]

        if c == '(':
            tokens.append(Token('LPAREN', '(', index))
            index += 1
            continue
        elif c == ')':
            tokens.append(Token('RPAREN', ')', index))
            index += 1
            continue
        elif s[index:index+2] == 'OR':
            tokens.append(Token('OR', 'OR', index))
            index += 2
            continue
        elif s[index:index+3] == 'AND':
            tokens.append(Token('AND', 'AND', index))
            index += 3
            continue
        else:
            start = index
            while index < len(s) and s[index].isalpha(): 
                index += 1
            if not start == index:
                tokens.append(Token('SYMBOL', s[start:index], start))
            else:
                index += 1

    return tokens

def eval_and(left, right):
    result = []
    for l in left:
        for r in right:
            result.append(l+r)

    return result 

def eval_or(left, right):
    left.extend(right)
    return left

def parse_symbol(tokens, index):
    token = tokens[index]
    if token.name == 'SYMBOL':
        return ([[token.value]], index+1)
    else:
        raise

def parse_paren(tokens, index):
    lparen = tokens[index]
    index += 1
    if not lparen.name == 'LPAREN':
        raise

    result, index = parse_expr(tokens, index)

    rparen = tokens[index]
    index += 1
    if not rparen.name == 'RPAREN':
        raise

    return (result, index)

def parse_expr(tokens, index):
    left = None
    right = None

    token = tokens[index]
    if token.name == 'LPAREN':
        left, index = parse_paren(tokens, index)
    elif token.name == 'SYMBOL':
        left, index = parse_symbol(tokens, index)

    op = tokens[index]
    index += 1
    if not op.name == 'OR' and not op.name == 'AND':
        raise

    while True:
        token = tokens[index]
        if token.name == 'LPAREN':
            right, index = parse_paren(tokens, index)
        elif token.name == 'SYMBOL':
            right, index = parse_symbol(tokens, index)

        op = eval_or if op.name == 'OR' else eval_and
        result = op(left, right)

        continue_ = False
        if not index == len(tokens):
            next_ = tokens[index]
            if next_.name == 'OR' or next_.name == 'AND':
                continue_ = True
                op = next_

        if continue_:
            left = result
            index += 1
            continue
        else:
            return (result, index)

def parse(tokens):
    result = None

    token = tokens[0]
    if token.name == 'LPAREN':
        result, _ = parse_paren(tokens, 0)
    else:
        result, _ = parse_expr(tokens, 0)

    return result

for s in [s1, s2, s3]:
    print parse(tokenize(s))

The output is

[['A', 'C'], ['A', 'D'], ['B', 'C'], ['B', 'D']]
[['A', 'B', 'D', 'E'], ['A', 'C', 'D', 'E']]
[['A'], ['B', 'C'], ['D'], ['E']]
Sign up to request clarification or add additional context in comments.

2 Comments

WOW! Thank you very much. This is exactly what i wanted. :)
Be careful -- the E is missing from the output of s1. parse(tokenize('(A AND B) OR C')) -> [['A', 'B']]
1

For the recursive nature of the expression, you can use the Forward element, and for the variable length clauses you can use ZeroOrMore. Based on your existing grammar:

expression = pp.Forward()
atom = gene_id | pp.Group(l_brackets + expression + r_brackets)
expression << atom + pp.ZeroOrMore(logical + expression)

With this, expression.parseString results in the following for your input strings:

[['(', ['(', 'A', 'OR', 'B', ')'], 'AND', ['(', 'C', 'OR', 'D', ')'], ')'], 'OR', 'E']
[['(', 'A', 'AND', ['(', 'B', 'OR', 'C', ')'], 'AND', 'D', 'AND', 'E', ')']]
['A', 'OR', ['(', 'B', 'AND', 'C', ')'], 'OR', 'D', 'OR', 'E']

If you want to get rid of the ( and ) in the output, you should call suppress() on l_bracket and r_bracket. Given the grouping (with Group) those are not really needed. Output will then be, e.g., [['A', 'AND', ['B', 'OR', 'C'], 'AND', 'D', 'AND', 'E']] for your second string.

The conversion to DNF is another matter and should best be asked in another question.

2 Comments

thank you very much, this looks really great and I'll be trying this as soon as i can! with this output, it should be easy to create the lists I'm looking for.
Read up on pyparsing's infixNotation method to help you handle proper precedence of operations. Also see the simpleBool.py example at the pyparsing wiki.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.