0

I am trying to process a string with CHAR(int) and NCHAR(int) to convert those instances with their ASCII counter-parts. An example would be something like this:

CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))

Note that I don't want to do anything to VARCHAR(int), and just to the CHAR(int) and NCHAR(int) parts only. The above should translate to:

|(SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns] WHERE xtype=U AND id = OBJECT_ID(EN_Empl)

Note that any "+" on either side of CHAR(int) or NCHAR(int) should be removed. I tried the the following:

def conv(m):
    return chr(int(m.group(2)))

print re.sub(r'([\+ ]?n?char\((.*?)\)[\+ ]?)', conv, str, re.IGNORECASE)

where str=the raw string that must be processed.

Somehow, the VARCHAR(8000) is being picked up. If I tweak the regex, the "=" after xtype is going away, rather than just the space and the "+" on either side of a CHAR(int) or NCHAR(int) instance.

Hope someone can pull me out of this.

ADDITIONAL SAMPLE STRINGS:

String "char(124)+(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"

Regex: r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

Result: "|(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(ENCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"

3 Answers 3

2

You have three issues:

  1. You need to use flags=re.IGNORECASE and not just re.IGNORECASE in re.sub. That is a keyword argument.
  2. You need to use \b to find the word boundary.
  3. You should not use str as a name since you will overwrite the built-in by the same name

This works:

import re

tgt='''\
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))'''

pat=r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

def conv(m):
    return chr(int(m.group(2)))

print re.sub(pat, conv, tgt, flags=re.IGNORECASE)    

More completely:

import re

tgt='''\
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))'''

pat=r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

def conv(m):
    return chr(int(m.group(2)))

print re.sub(r'''
              (                                 # group 1
              \b                                # word boundary
              n?char                            # nchar or char
              \(                                # literal left paren
              (\s*\d+\s*)                       # digits surrounded by spaces
              \)                                # literal right paren
              (?:\s*\+\s*)?                     # optionally followed by a concating '+' 
              )                                 '''
            , conv, tgt, flags=re.VERBOSE | re.IGNORECASE)   

Prints:

|(SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=U
AND id = OBJECT_ID(EN_Empl)
Sign up to request clarification or add additional context in comments.

3 Comments

I used your expression and I think I am a step closer! However, if you look at the new sample I have provided in my question alonside the regex you provided, you will observe that a series of nchar(int) instances separated by "+" seems to convert only the first instance and the remaining instances are preserved as is. Any suggestions? NCHAR(69) was replaced with "E" but everything else stayed the same.
Did you use keyword re.sub(..., flags=re.IGNORECASE)?
Yes I used the IGNORECASE flag. @Tim Peters' answer worked for me. I think the difference between your answer and his is that he added the "+" and whitespace portion before and after, which was needed. Thanks for your help!!
1

You can go a long way just by adding the word boundary (\b) assertion, but I'd like to suggest that you (1) use re.VERBOSE to write a regexp someone can understand later; (2) compile the regexp to reduce clutter at the call site; and, (3) tighten some of the matching criteria. Like so:

def conv(m):
    return chr(int(m.group(1)))

pat = re.compile(r"""[+\s]*    # optional whitespace or +
                     \b        # word boundary
                     n?char    # NCHAR or CHAR
                     \(        # left paren
                     ([\d\s]+) # digits or spaces - group 1
                     \)        # right paren
                     [+\s]*    # optional whitespace or +
                  """, re.VERBOSE | re.IGNORECASE)
print pat.sub(conv, data)

Note that I changed your str to data: str is the name of a heavily used builtin function, and it's a Really Bad Idea to create a variable with the same name.

11 Comments

Thanks @Tim Peters. Appreciate the suggestion to improve readability; it helps me too (let alone someone else!). I did try it out and it appears to work! One thing I don't understand about the grouping+search+replace works in your regex. I was originally creating one group (for the value to be converted to its ASCII equivalent) within another group (that encapsulated the "+" and whitespace around the [N]CHAR(int) instances). Your regex removes any surrounding "+" or whitespace even though it is not part of the group. I've to spend more time with regex fundamentals. Thanks for your help!
You're welcome :-) sub() replaces the entire substring matched by the regexp, so there was really no need for the outermost group. That's why I removed it. We do still need a group to isolate the digits, though, so that conv() can find them easily. But the output of conv() replaces the entire substring matched by the regexp. Maybe a little subtle at first, but you'll get used to it quickly ;-)
Thanks @Tim Peters for the very helpful explanation and answer!
I have another instance of these strings which contains something like _char(75) or _nchar(65) the conversion is skipping. This I understand is because "_" is considered as part of the criteria for a word boundary, i.e. [a-zA-Z0-9_]. How do I handle this using the above pattern that works for the majority of cases?
Add _? to the regexp after the \b line. And notice how easy it is to change a regexp when it's written in little pieces spread across multiple lines ;-)
|
0

You only need to use a word boundary \b:

def conv(m):
    return chr(int(m.group(1)))

print re.sub(r'\bn?char\(([^)]+)\)(?:\s*\+\s*)?', conv, str, re.IGNORECASE)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.