3

I am using Python 2.6.4.

I have a series of select statements in a text file and I need to extract the field names from each select query. This would be easy if some of the fields didn't use nested functions like to_char() etc.

Given select statement fields that could have several nested parenthese like "ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name," or the simple case of just "base_field_name" as a field, is it possible to use Python's re module to write a regex to extract base_field_name? If so, what would the regex look like?

6 Answers 6

11

Regular expressions are not suitable for parsing "nested" structures. Try, instead, a full-fledged parsing kit such as pyparsing -- examples of using pyparsing specifically to parse SQL can be found here and here, for example (you'll no doubt need to take the examples just as a starting point, and write some parsing code of your own, but, it's definitely not too difficult).

Sign up to request clarification or add additional context in comments.

2 Comments

+1 for remembering the world that well-parenthesized expressions (well, all Chomsky Type-2 languages) need more than a regexp to be properly parsed :)
You are not correct, they are suitable for that (just python doesn't support it yet)... pure PCRE regex matching nested paired parentheses would look like ^(?P<pn>\((?P<v>((?>[^()]+)|(?P>pn))*)\))$, and it will match ((1+2)*3) and matched group v would contain (1+2)*3.
2
>>> import re
>>> string = 'ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name'
>>> rx = re.compile('^(.*?\()*(.+?)(,.*?)*(,|\).*?)*$')
>>> rx.search(string).group(2)
'base_field_name'
>>> rx.search('base_field_name').group(2)
'base_field_name'

2 Comments

PS: as told by Alex Martelli, you should use a real parser here. Anyway, if you only want a quick regex that just works, you can use this. But you should really use a parser, as this regex looks rather ugly :)
I'm not after something that looks pretty since it's a one off to get me the data I want so I can do other things with it. :) But thanks, my regex is rusty and I figured someone might know better.
2

Either a table-driven parser as Alex Martelli suggests or a hand-written recursive descent parser. They're not hard and quite rewarding to write.

Comments

1

This may be good enough:

import re
print re.match(r".*\(([^\)]+)\)", "ltrim(to_char(field_name, format)))").group(1)

You would need to do further processing. For example pick up the function name as well and pull the field name according to function signature.

.*(\w+)\(([^\)]+)\)

2 Comments

this prints 'field_name, format', not 'field_name' for me, and also doesn't work for the simple string 'field_name'.
How do you know every function is going to accept same arguments?
1

Here's a really hacky parser that does what you want.

It works by calling 'eval' on the text to be parsed, mapping all identifiers to a function which returns its first argument (which I'm guessing is what you want given your example).

class FakeFunction(object):
    def __init__(self, name):
        self.name = name
    def __call__(self, *args):
        return args[0]
    def __str__(self):
        return self.name

class FakeGlobals(dict):
    def __getitem__(self, x):
        return FakeFunction(x)

def ExtractBaseFieldName(x):
    return eval(x, FakeGlobals())

print ExtractBaseFieldName('ltrim(rtrim(to_char(base_field_name, format)))')

Comments

0

Do you really need regular expressions? To get the one you've got up there I'd use

  s[s.rfind('(')+1:s.find(')')].split(',')[0]

with 's' containing the original string.

Of course, it's not a general solution, but...

3 Comments

A compiled regex should be much faster than this. Well, I guess we're not in a hurry, but still, just for the sake of efficiency.
You may find that working directly with strings is faster. Depends heavily on the regex and the complexity of the equivalent code you need to write to do the thing without regex. Actually, have you tried comparing both them?
Oh, and in case I'd want to go for the equivalent regexp, I'd use "\(([^(),]+),", which is slightly faster than the purely string-based one. Both of them are one order of magnitude faster than your regexp...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.