2

Hi all I've seen several solutions to extracting python comments using tokenizer and looking for

tokenize.COMMENT

However, I can't find any example of how to do it for multiline comment. For example consider the following readme.py file

"""
multiline comment above class read me
read me too
"""
# dont read mee
class TestComment:
    """
        multiline inside class
    """
    def aFunc(self):
        pass

How can I extract the contents of mulitline comments like the example above?

2
  • As far as a tokenizer would go, those are multiline string literals, not comments. Commented May 1, 2021 at 12:24
  • docstring was excluded from the tokenize library but if you miss the ending """ it throws an exception. Interesting, they did handle the exception. 459 if contstr: # continued string 460 if not line: --> 461 raise TokenError("EOF in multi-line string", strstart) 462 endmatch = endprog.match(line) 463 if endmatch: TokenError: ('EOF in multi-line string', (13, 0)) Commented May 1, 2021 at 13:23

2 Answers 2

2

tokenize.COMMENT can be used for comments (that begins with #), not for multiline strings literal or docstrings.

However, you can use this regex in order to extract the multiline strings from your example:

import re

file = "file.py"

with open(file, "r") as f:
    content = f.read()

p = re.compile('(?:""")(.*?)(?:""")', re.DOTALL)
result = p.findall(content)
print(result)

Output:

['\nmultiline comment above class read me\nread me too\n', '\n        multiline inside class\n    ']

If you want to keep the """, just use capturing groups instead of non-capturing groups : (""") instead of (?:""").

Using re.DOTALL is important, it allows the dot . to match any character including a newline.

A little warning:

Please note that as @edusanketd said in comment, this regex will match triple quotes used inside regular strings or single line comments too. So, this regex is not the panacea: if all your python files are structured as in your example (""" are used ONLY for multilines strings), it will be fine, but if you have some files that use """ for other purposes (like triple quote strings used inside regular strings) their wil be some "errors".

Example code showing the limits of this regex :

"""
multiline comment above class read me
read me too
"""
# dont read mee
class TestComment:
    """
        multiline inside class
    """
    def aFunc(self):
        pass
a_string = '"""THIS IS NOT A COMMENT"""'
# """dont read me too"""

output:

['\nmultiline comment above class read me\nread me too\n', '\n        multiline inside class\n    ', 'THIS IS NOT A COMMENT', 'dont read me too']

Some informations about multi-line strings as multi-line comments :

A tweet form Guido van Rossum :

https://twitter.com/gvanrossum/status/112670605505077248?lang=en

Python tip: You can use multi-line strings as multi-line comments. Unless used as docstrings, they generate no code! :-)

And here is an interesting post from Sean Gillies on this subject:

https://sgillies.net/2017/05/30/python-multi-line-comments-and-triple-quoted-strings.html

Sign up to request clarification or add additional context in comments.

2 Comments

Triple quotes used inside regular strings or single line comments would also get counted this way. That is a major drawback though.
Yes I agree, this is not the panacea, that's why I mentioned that it could be used in his example. You are right to point it out, I'll edit my post to mention that too.
1

The so-called multiline comments in python are actually treated as docstrings. Conventionally a docstring is associated with a class in python. Which can be retrieved using objname.__doc__. Here objname replaces the immediate class or function associated with the docstring.

For example,

class ex:
    """dufvdv"""
    def __init__(self):
        # single line cmt
        """ multiline inside function
        another line """
        pass

Using the __doc__ here goes like:

>>> ex.__doc__
'dufvdv'
>>> ex.__init__.__doc__
' multiline inside function\nanother line'

I assume you were expecting an inbuilt technique to extract all multiline comments from a code, which is not present, to the best of my knowledge.

So, you can use the __doc__ accordingly.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.