recursive nested expression in Python

Question

I am using Python 2.6.4.

I have a series of select statements in a text file and I need to extract the field names from each select query. This would be easy if some of the fields didn't use nested functions like to_char() etc.

Given select statement fields that could have several nested parenthese like "ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name," or the simple case of just "base_field_name" as a field, is it possible to use Python's re module to write a regex to extract base_field_name? If so, what would the regex look like?

Alex Martelli · Accepted Answer · 2010-02-01 00:31:35Z

11

Regular expressions are not suitable for parsing "nested" structures. Try, instead, a full-fledged parsing kit such as pyparsing -- examples of using pyparsing specifically to parse SQL can be found here and here, for example (you'll no doubt need to take the examples just as a starting point, and write some parsing code of your own, but, it's definitely not too difficult).

answered Feb 1, 2010 at 0:31

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Agos Over a year ago

+1 for remembering the world that well-parenthesized expressions (well, all Chomsky Type-2 languages) need more than a regexp to be properly parsed :)

sebres Over a year ago

You are not correct, they are suitable for that (just python doesn't support it yet)... pure PCRE regex matching nested paired parentheses would look like ^(?P<pn>$(?P<v>((?>[^()]+)|(?P>pn))*)$)$, and it will match ((1+2)*3) and matched group v would contain (1+2)*3.

Attila O. · Accepted Answer · 2010-02-01 00:55:48Z

2

>>> import re
>>> string = 'ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name'
>>> rx = re.compile('^(.*?\()*(.+?)(,.*?)*(,|\).*?)*$')
>>> rx.search(string).group(2)
'base_field_name'
>>> rx.search('base_field_name').group(2)
'base_field_name'

edited Feb 1, 2010 at 0:55

answered Feb 1, 2010 at 0:38

Attila O.

16.8k12 gold badges58 silver badges86 bronze badges

2 Comments

Attila O. Over a year ago

PS: as told by Alex Martelli, you should use a real parser here. Anyway, if you only want a quick regex that just works, you can use this. But you should really use a parser, as this regex looks rather ugly :)

TheObserver Over a year ago

I'm not after something that looks pretty since it's a one off to get me the data I want so I can do other things with it. :) But thanks, my regex is rusty and I figured someone might know better.

just somebody · Accepted Answer · 2010-02-01 00:40:17Z

2

Either a table-driven parser as Alex Martelli suggests or a hand-written recursive descent parser. They're not hard and quite rewarding to write.

answered Feb 1, 2010 at 0:40

just somebody

19.4k6 gold badges55 silver badges65 bronze badges

Comments

mtmk · Accepted Answer · 2010-02-01 01:22:23Z

1

This may be good enough:

import re
print re.match(r".*\(([^\)]+)\)", "ltrim(to_char(field_name, format)))").group(1)

You would need to do further processing. For example pick up the function name as well and pull the field name according to function signature.

.*(\w+)\(([^\)]+)\)

edited Feb 1, 2010 at 1:22

answered Feb 1, 2010 at 0:44

mtmk

6,35629 silver badges34 bronze badges

2 Comments

Attila O. Over a year ago

this prints 'field_name, format', not 'field_name' for me, and also doesn't work for the simple string 'field_name'.

mtmk Over a year ago

How do you know every function is going to accept same arguments?

user97370 · Accepted Answer · 2010-02-01 03:07:42Z

1

Here's a really hacky parser that does what you want.

It works by calling 'eval' on the text to be parsed, mapping all identifiers to a function which returns its first argument (which I'm guessing is what you want given your example).

class FakeFunction(object):
    def __init__(self, name):
        self.name = name
    def __call__(self, *args):
        return args[0]
    def __str__(self):
        return self.name

class FakeGlobals(dict):
    def __getitem__(self, x):
        return FakeFunction(x)

def ExtractBaseFieldName(x):
    return eval(x, FakeGlobals())

print ExtractBaseFieldName('ltrim(rtrim(to_char(base_field_name, format)))')

answered Feb 1, 2010 at 3:07

user97370

Comments

Ricardo Cárdenes · Accepted Answer · 2010-02-01 01:25:32Z

0

Do you really need regular expressions? To get the one you've got up there I'd use

  s[s.rfind('(')+1:s.find(')')].split(',')[0]

with 's' containing the original string.

Of course, it's not a general solution, but...

answered Feb 1, 2010 at 1:25

Ricardo Cárdenes

9,1941 gold badge23 silver badges35 bronze badges

3 Comments

Attila O. Over a year ago

A compiled regex should be much faster than this. Well, I guess we're not in a hurry, but still, just for the sake of efficiency.

Ricardo Cárdenes Over a year ago

You may find that working directly with strings is faster. Depends heavily on the regex and the complexity of the equivalent code you need to write to do the thing without regex. Actually, have you tried comparing both them?

Ricardo Cárdenes Over a year ago

Oh, and in case I'd want to go for the equivalent regexp, I'd use "\(([^(),]+),", which is slightly faster than the purely string-based one. Both of them are one order of magnitude faster than your regexp...

Collectives™ on Stack Overflow

recursive nested expression in Python

6 Answers 6

2 Comments

2 Comments

Comments

2 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

2 Comments

Comments

2 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related