python regex to remove comments

Question

How would I write a regex that removes all comments that start with the # and stop at the end of the line -- but at the same time exclude the first two lines which say

#!/usr/bin/python

and

#-*- coding: utf-8 -*-

Comments don't slow your code down. Why do you want to remove them? — agf
– agf, Commented Aug 11, 2011 at 20:03
You don't :). At least, not with a simple regex. Consider the following: s = 'not # a # comment!', or this: s = """ \n foo # \n bar """ (where \n are actual line breaks) — Bart Kiers
– Bart Kiers, Commented Aug 11, 2011 at 20:06
@agf, to make things more difficult for the next person to work on the code! — bgw
– bgw, Commented Aug 11, 2011 at 20:06
This question is similar to stackoverflow.com/q/1621521 , where there is already a (not entirely regex) solution that may satisfy your needs — bgw
– bgw, Commented Aug 11, 2011 at 20:13

unutbu · Accepted Answer · 2019-02-11 14:02:17Z

5

You can remove comments by parsing the Python code with tokenize.generate_tokens. The following is a slightly modified version of this example from the docs:

import tokenize
import io
import sys
if sys.version_info[0] == 3:
    StringIO = io.StringIO
else:
    StringIO = io.BytesIO

def nocomment(s):
    result = []
    g = tokenize.generate_tokens(StringIO(s).readline)  
    for toknum, tokval, _, _, _  in g:
        # print(toknum,tokval)
        if toknum != tokenize.COMMENT:
            result.append((toknum, tokval))
    return tokenize.untokenize(result)

with open('script.py','r') as f:
    content=f.read()

print(nocomment(content))

For example:

If script.py contains

def foo(): # Remove this comment
    ''' But do not remove this #1 docstring 
    '''
    # Another comment
    pass

then the output of nocomment is

def foo ():
    ''' But do not remove this #1 docstring 
    '''

    pass

edited Feb 11, 2019 at 14:02

answered Aug 11, 2011 at 20:37

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

bgw Over a year ago

I'm just curious: How well does this handle stuff like extra whitespace?

unutbu Over a year ago

@PiPeep: For an example of how tokenize can handle whitespace, see reindent.py.

Nihal Over a year ago

i think your code need updation, which is now giving error,

File "/usr/lib/python3.6/tokenize.py", line 565, in _tokenize     if line[pos] in '#\r\n':           # skip comments or blank lines TypeError: 'in <string>' requires string as left operand, not int

in the library itself

Nihal Over a year ago

your code works in python 2.7, but not in python 3.6

unutbu Over a year ago

Updated for Python3.

bgw · Accepted Answer · 2011-08-11 20:09:04Z

1

I don't actually think this can be done purely with a regex expression, as you'd need to count quotes to ensure that an instance of # isn't inside of a string.

I'd look into python's built-in code parsing modules for help with something like this.

edited Aug 11, 2011 at 20:09

answered Aug 11, 2011 at 20:01

bgw

2,0561 gold badge20 silver badges28 bronze badges

Comments

Boldewyn · Accepted Answer · 2011-08-11 20:12:18Z

1

sed -e '1,2p' -e '/^\s*#/d' infile

Then wrap this in a subprocess.Popen call.

However, this doesn't substitute a real parser! Why would this be of interest? Well, assume this Python script:

output = """
This is
#1 of 100"""

Boom, any non-parsing solution instantly breaks your script.

edited Aug 11, 2011 at 20:12

answered Aug 11, 2011 at 20:02

Boldewyn

83.1k45 gold badges161 silver badges218 bronze badges

1 Comment

bgw Over a year ago

Why not just use the python re package in the example, rather than requiring a platform-dependent tool?

Collectives™ on Stack Overflow

python regex to remove comments

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related