Remove comments from C-like source code

Question

I am working on the problem of removing comments from C-like source code. Here is my code in Python 2.7, and if anyone could advise improvement areas (especially performance), or any functional bugs which I do not discover, it will be great.

Problem statement

Given a file path represented as string, take this input string and remove all the comments in the file, print this file or save this to a new txt file by your choice.

Cases to consider:

// comment
/*

    comment
    */
    foo(); // comment

Source code

code='''// comment
/*
    /* hello python */
    comment
    */
    foo(); // comment
'''

def remove_comment(content):
    index = 0
    comment_line_inside = False
    comment_block_level = 0
    result = []
    while index < len(content):
        if content[index] == '/' and index + 1 < len(content) and content[index+1] == '*':
            comment_block_level += 1
        elif content[index] == '/' and content[index-1] == '*':
            comment_block_level -= 1
        elif content[index] == '/' and index + 1 < len(content) and content[index + 1] == '/':
            comment_line_inside = True
        elif content[index] == '\n' and comment_line_inside == True:
            comment_line_inside = False
        elif not comment_line_inside and comment_block_level == 0:
            result.append(content[index])
        index += 1

    return ''.join(result)

if __name__ == "__main__":
    print remove_comment(code)

Considering the C-style kind of comments, remove_comment(code) should return \n comment\n */\n foo(); and not \n foo(); . — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Nov 28, 2016 at 8:47
Iterating a string using an index does not "look python". An alternative to using index-1 would be to keep a lastChar. You have three conditions starting …== '/' and… in a row. Is a /* in an in-line-comment (comment_line_inside) really intended to increase the comment level? You may find it easier to "skip the rest of the line" as soon as the start of an in-line-comment is recognised. Your code lacks docstrings and comments ("The problem" may be writing code not easily misunderstood - adding and maintaining comments instead of removing them might help.) — greybeard
– greybeard, Commented Nov 28, 2016 at 8:48
@MathiasEttinger: while keeping empty lines for line numbering is a fine point, why should \n comment (or even \n */) show up? — greybeard
– greybeard, Commented Nov 28, 2016 at 8:50
@greybeard because comments in C are not nested: the first */ closes the first /*. — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Nov 28, 2016 at 8:57

301_Moved_Permanently · Accepted Answer · 2016-11-30 07:50:15Z

First off, as said in the comments, C-style comments matches the first /* with the first */; meaning you can not nest comments:

/* This comment is /* a nested */ comment */

should be interpreted as: comment */

It is also more natural, in Python, to iterate over the elements of a collection rather than their indices. This allows you to write for character in content:. And if you trully need indices, you can use enumerate.

You can also use temporary variables to store characters that may indicate the beginning or the end of a comment without having to look at the character before or after the current one:

def remove_comments(content):
    block_comment = False
    line_comment = False
    probably_a_comment = False
    result = []
    for character in content:
        if not line_comment and not block_comment and character == '/':
            probably_a_comment = True
            continue

        if block_comment and character == '*':
            probably_a_comment = True
            continue

        if line_comment and character == '\n':
            line_comment = False
            result.append('\n')
        elif block_comment and probably_a_comment and character == '/':
            block_comment = False
        elif not line_comment and not block_comment:
            if probably_a_comment:
                if character == '/':
                    line_comment = True
                elif character == '*':
                    block_comment = True
                else:
                    result.append('/')  # Append the / we skipped when flagging that it was probably a comment starting
                    result.append(character)
            else:
                result.append(character)
        probably_a_comment = False

    return ''.join(result)

You can also simplify a bit the memory management by using a generator instead of appending into a list:

def remove_comments(content):
    def gen_content():
        block_comment = False
        line_comment = False
        probably_a_comment = False
        for character in content:
            if not line_comment and not block_comment and character == '/':
                probably_a_comment = True
                continue

            if block_comment and character == '*':
                probably_a_comment = True
                continue

            if line_comment and character == '\n':
                line_comment = False
                yield '\n'
            elif block_comment and probably_a_comment and character == '/':
                block_comment = False
            elif not line_comment and not block_comment:
                if probably_a_comment:
                    if character == '/':
                        line_comment = True
                    elif character == '*':
                        block_comment = True
                    else:
                        yield '/'
                        yield character
                else:
                    yield character
            probably_a_comment = False

    return ''.join(gen_content())

If you want to go crazy, you can also use a state machine approach to simplify the code: no more boolean flags and far less comparisons in average:

def source_code(char):
    if char == '/':
        return comment_begin, ''
    return source_code, char

def comment_begin(char):
    if char == '/':
        return inline_comment, ''
    if char == '*':
        return block_comment, ''
    return source_code, '/'+char

def inline_comment(char):
    if char == '\n':
         return source_code, char
    return inline_comment, ''

def block_comment(char):
    if char == '*':
        return end_block_comment, ''
    return block_comment, ''

def end_block_comment(char):
    if char == '/':
        return source_code, ''
    return block_comment, ''

def remove_comments(content):
    def gen_content():
        parser = source_code
        for character in content:
            parser, text = parser(character)
            yield text

    return ''.join(gen_content())

But, all in all, this is far too complicated for the task at hand. You can get the same job done using a simple regular expression:

import re


COMMENTS = re.compile(r'''
    (//[^\n]*(?:\n|$))    # Everything between // and the end of the line/file
    |                     # or
    (/\*.*?\*/)           # Everything between /* and */
''', re.VERBOSE)


def remove_comments(content):
    return COMMENTS.sub('\n', content)

Thanks Mathias, love your comments, but confused by this line --if block_comment and character == '*': probably_a_comment = True, I think block_comment means already in block comment area, and why you need to set probably_a_comment to be True? My confusion is I think probably_a_comment means not sure if in comments area or not, but block_comment is True it means we already in comments area? — Lin Ma
– Lin Ma, Commented Nov 30, 2016 at 7:29
@LinMa You're right block_comment being True means "inside a block comment". The catch is probably_a_comment is used for either the beginning or the end of a comment. Here we're checking if we are not encoutering the start of */. — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Nov 30, 2016 at 7:33
@LinMa You have less conditions but they are more complex. The advantage of my approach being to not use indices at all. But all in all, both code are close since they are outperformed by the regexp. — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Nov 30, 2016 at 7:40
@LinMa Since you talked about state machine, I added a version that is closer to it. But still, the regexp engine wins. — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Nov 30, 2016 at 7:50
@LinMa I don't call it, I return the function, so that latter on it will be stored in parser and called with the next character using parser(character). — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Dec 1, 2016 at 8:42

Stack Exchange Network

Remove comments from C-like source code

Problem statement

Source code

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Remove comments from C-like source code

Problem statement

Source code

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions