2

How would I write a regex that removes all comments that start with the # and stop at the end of the line -- but at the same time exclude the first two lines which say

#!/usr/bin/python 

and

#-*- coding: utf-8 -*-
4
  • 3
    Comments don't slow your code down. Why do you want to remove them? Commented Aug 11, 2011 at 20:03
  • You don't :). At least, not with a simple regex. Consider the following: s = 'not # a # comment!', or this: s = """ \n foo # \n bar """ (where \n are actual line breaks) Commented Aug 11, 2011 at 20:06
  • @agf, to make things more difficult for the next person to work on the code! Commented Aug 11, 2011 at 20:06
  • 2
    This question is similar to stackoverflow.com/q/1621521 , where there is already a (not entirely regex) solution that may satisfy your needs Commented Aug 11, 2011 at 20:13

3 Answers 3

5

You can remove comments by parsing the Python code with tokenize.generate_tokens. The following is a slightly modified version of this example from the docs:

import tokenize
import io
import sys
if sys.version_info[0] == 3:
    StringIO = io.StringIO
else:
    StringIO = io.BytesIO

def nocomment(s):
    result = []
    g = tokenize.generate_tokens(StringIO(s).readline)  
    for toknum, tokval, _, _, _  in g:
        # print(toknum,tokval)
        if toknum != tokenize.COMMENT:
            result.append((toknum, tokval))
    return tokenize.untokenize(result)

with open('script.py','r') as f:
    content=f.read()

print(nocomment(content))

For example:

If script.py contains

def foo(): # Remove this comment
    ''' But do not remove this #1 docstring 
    '''
    # Another comment
    pass

then the output of nocomment is

def foo ():
    ''' But do not remove this #1 docstring 
    '''

    pass 
Sign up to request clarification or add additional context in comments.

5 Comments

I'm just curious: How well does this handle stuff like extra whitespace?
@PiPeep: For an example of how tokenize can handle whitespace, see reindent.py.
i think your code need updation, which is now giving error, File "/usr/lib/python3.6/tokenize.py", line 565, in _tokenize if line[pos] in '#\r\n': # skip comments or blank lines TypeError: 'in <string>' requires string as left operand, not int in the library itself
your code works in python 2.7, but not in python 3.6
Updated for Python3.
1

I don't actually think this can be done purely with a regex expression, as you'd need to count quotes to ensure that an instance of # isn't inside of a string.

I'd look into python's built-in code parsing modules for help with something like this.

Comments

1
sed -e '1,2p' -e '/^\s*#/d' infile

Then wrap this in a subprocess.Popen call.

However, this doesn't substitute a real parser! Why would this be of interest? Well, assume this Python script:

output = """
This is
#1 of 100"""

Boom, any non-parsing solution instantly breaks your script.

1 Comment

Why not just use the python re package in the example, rather than requiring a platform-dependent tool?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.