2

I'm trying to find and print the beginning and ending indices of the C keywords inside the string

code = 'int main( void )\n{\nreturn 0;\n}'

Here's what I have so far:

pattern = '/\bint|void|return\b/'
temp = re.compile( pattern )
for result in temp.finditer( code ):
    print 'Found %s from %d to %d.' % ( result.group(), result.start(), result.end() )

However, only 'void' is being found. Why is that?

1
  • 1
    If it's always going to be int <something> return <something>; why not simply match the string ? code.find('int ') ... Commented May 7, 2015 at 3:01

3 Answers 3

2

Here is an example:

src='''\
int main( void )
   {
      return 0;
   }
'''

import re

for key, span in ((m.group(1), m.span(1)) for m in re.finditer(r'\b(int|main|void|return)\b', src)):
    print key, span

Prints:

int (0, 3)
main (4, 8)
void (10, 14)
return (28, 34)

But I think using a set of keywords to validate found words in better than having all the words in a pattern.

Consider:

keywords={'int', 'main', 'void', 'return'}

for key, span in ((m.group(1), m.span(1)) for m in re.finditer(r'\b(\w+)\b', src) 
                                                          if m.group(1) in keywords):
    print key, span

Same output, but easier to add words.

Sign up to request clarification or add additional context in comments.

1 Comment

You still want to keep the word-boundary conditions.
2
pattern = '/\bint|void|return\b/' # wrong

1. Python doesn't enclose patterns in /:

pattern = '\bint|void|return\b' # still wrong

2. You really want to make this a raw string, otherwise \b is interpreted as a control character:

pattern = r'\bint|void|return\b' # still wrong

3. You need to enclose your or-group in parentheses:

pattern = r'\b(int|void|return)\b' # yay

And then:

re.compile(pattern).findall(code)
# ['int', 'void', 'return']

In your original pattern, the entire thing was being interepreted as three separate or-sections:
/\bint, void, and return\b/, thus it was naturally only finding void.

Comments

2

First of all, Python doesn't use forward slashes (/) to indicate the start and end of a regular expression pattern. By convention, raw strings are used instead. Raw strings are a way of avoiding special character encodings in strings. The most common example would be a newline character ('\n'). Normally these two characters would be transformed into the single special newline character, but if we want a literal forward slash followed by a literal n, we use a raw string like r'\n'. Alternatively, we could escape the backslash character and write it as '\\n', but for a longer string with more special characters, we really want to avoid throwing in backslashes everywhere. As you may notice, raw strings are a very convenient method for writing regular expressions.

You forgot to make your pattern a raw string so the \b's are being interpreted as special escaped characters (in this case it translates into ASCII character #8 for whatever reason, not really sure why) instead of word boundaries. You can make any string literal a raw string by prepending an r before the string:

>>> re.findall('\bint|void|return\b', 'int main( void )\n{\nreturn 0;\n}')
['void']
>>> re.findall(r'\bint|void|return\b', 'int main( void )\n{\nreturn 0;\n}')
['int', 'void', 'return']

2 Comments

Thanks. I added the r prefix, but I'm still getting the same results.
@FernandoKarpinski Ah sorry, I missed the two forward slashes in your string. I have edited my answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.