Python regex for select/extract from nested groups

Question

I am trying to process a string with CHAR(int) and NCHAR(int) to convert those instances with their ASCII counter-parts. An example would be something like this:

CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))

Note that I don't want to do anything to VARCHAR(int), and just to the CHAR(int) and NCHAR(int) parts only. The above should translate to:

|(SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns] WHERE xtype=U AND id = OBJECT_ID(EN_Empl)

Note that any "+" on either side of CHAR(int) or NCHAR(int) should be removed. I tried the the following:

def conv(m):
    return chr(int(m.group(2)))

print re.sub(r'([\+ ]?n?char\((.*?)\)[\+ ]?)', conv, str, re.IGNORECASE)

where str=the raw string that must be processed.

Somehow, the VARCHAR(8000) is being picked up. If I tweak the regex, the "=" after xtype is going away, rather than just the space and the "+" on either side of a CHAR(int) or NCHAR(int) instance.

Hope someone can pull me out of this.

ADDITIONAL SAMPLE STRINGS:

String "char(124)+(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"

Regex: r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

Result: "|(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(ENCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"

dawg · Accepted Answer · 2013-12-14 19:00:07Z

2

You have three issues:

You need to use flags=re.IGNORECASE and not just re.IGNORECASE in re.sub. That is a keyword argument.
You need to use \b to find the word boundary.
You should not use str as a name since you will overwrite the built-in by the same name

This works:

import re

tgt='''\
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))'''

pat=r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

def conv(m):
    return chr(int(m.group(2)))

print re.sub(pat, conv, tgt, flags=re.IGNORECASE)

More completely:

import re

tgt='''\
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))'''

pat=r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

def conv(m):
    return chr(int(m.group(2)))

print re.sub(r'''
              (                                 # group 1
              \b                                # word boundary
              n?char                            # nchar or char
              \(                                # literal left paren
              (\s*\d+\s*)                       # digits surrounded by spaces
              \)                                # literal right paren
              (?:\s*\+\s*)?                     # optionally followed by a concating '+' 
              )                                 '''
            , conv, tgt, flags=re.VERBOSE | re.IGNORECASE)

Prints:

|(SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=U
AND id = OBJECT_ID(EN_Empl)

edited Dec 14, 2013 at 19:00

answered Dec 14, 2013 at 18:32

dawg

105k24 gold badges142 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Web User Over a year ago

I used your expression and I think I am a step closer! However, if you look at the new sample I have provided in my question alonside the regex you provided, you will observe that a series of nchar(int) instances separated by "+" seems to convert only the first instance and the remaining instances are preserved as is. Any suggestions? NCHAR(69) was replaced with "E" but everything else stayed the same.

dawg Over a year ago

Did you use keyword re.sub(..., flags=re.IGNORECASE)?

Web User Over a year ago

Yes I used the IGNORECASE flag. @Tim Peters' answer worked for me. I think the difference between your answer and his is that he added the "+" and whitespace portion before and after, which was needed. Thanks for your help!!

Tim Peters · Accepted Answer · 2013-12-14 18:39:15Z

1

You can go a long way just by adding the word boundary (\b) assertion, but I'd like to suggest that you (1) use re.VERBOSE to write a regexp someone can understand later; (2) compile the regexp to reduce clutter at the call site; and, (3) tighten some of the matching criteria. Like so:

def conv(m):
    return chr(int(m.group(1)))

pat = re.compile(r"""[+\s]*    # optional whitespace or +
                     \b        # word boundary
                     n?char    # NCHAR or CHAR
                     \(        # left paren
                     ([\d\s]+) # digits or spaces - group 1
                     \)        # right paren
                     [+\s]*    # optional whitespace or +
                  """, re.VERBOSE | re.IGNORECASE)
print pat.sub(conv, data)

Note that I changed your str to data: str is the name of a heavily used builtin function, and it's a Really Bad Idea to create a variable with the same name.

answered Dec 14, 2013 at 18:39

Tim Peters

71.4k14 gold badges133 silver badges140 bronze badges

11 Comments

Web User Over a year ago

Thanks @Tim Peters. Appreciate the suggestion to improve readability; it helps me too (let alone someone else!). I did try it out and it appears to work! One thing I don't understand about the grouping+search+replace works in your regex. I was originally creating one group (for the value to be converted to its ASCII equivalent) within another group (that encapsulated the "+" and whitespace around the [N]CHAR(int) instances). Your regex removes any surrounding "+" or whitespace even though it is not part of the group. I've to spend more time with regex fundamentals. Thanks for your help!

Tim Peters Over a year ago

You're welcome :-) sub() replaces the entire substring matched by the regexp, so there was really no need for the outermost group. That's why I removed it. We do still need a group to isolate the digits, though, so that conv() can find them easily. But the output of conv() replaces the entire substring matched by the regexp. Maybe a little subtle at first, but you'll get used to it quickly ;-)

Web User Over a year ago

Thanks @Tim Peters for the very helpful explanation and answer!

Web User Over a year ago

I have another instance of these strings which contains something like _char(75) or _nchar(65) the conversion is skipping. This I understand is because "_" is considered as part of the criteria for a word boundary, i.e. [a-zA-Z0-9_]. How do I handle this using the above pattern that works for the majority of cases?

Tim Peters Over a year ago

Add _? to the regexp after the \b line. And notice how easy it is to change a regexp when it's written in little pieces spread across multiple lines ;-)

|

Casimir et Hippolyte · Accepted Answer · 2013-12-14 18:26:21Z

0

You only need to use a word boundary \b:

def conv(m):
    return chr(int(m.group(1)))

print re.sub(r'\bn?char\(([^)]+)\)(?:\s*\+\s*)?', conv, str, re.IGNORECASE)

answered Dec 14, 2013 at 18:26

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Collectives™ on Stack Overflow

Python regex for select/extract from nested groups

3 Answers 3

3 Comments

11 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

11 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related