0

How can I parse a syntactically correct C file containing a single function but with non-defined types? The file is automatically indented (4 spaces) using this service with brackets below each block keyword, i.e. something like

if ( condition1 )
{
    func1( int hi );
    unktype foo;
    do
    {
        if ( condition2 )
            goto LABEL_1;
    }
    while ( condition3 );
}
else
{
    float a = bar(baz, 0);
LABEL_1:
    int foobar = (int)a;
}

The first line is the prototype, the second is a "{". All the lines end with \n. The last line is simply "}\n" There are lots of many-to-one gotos, and the labels are often out of their block (awful, I know :D ) I only care about structural information, i.e. blocks and statement types. Here what I'd like to get (when printed, indent added for clarity):

[If(condition = [condition1], 
    bodytrue = ["func1( int hi );", 
                "unktype foo;" 
                DoWhile(condition = [condition3], 
                        body = [
                                SingleLineIf(condition = [condition2],
                                             bodytrue =["goto LABEL_1;"], 
                                             bodyelse = []
                                )
                                ]
                )
    ]
    bodyelse = ["float a = bar(baz, 0);",
               "int foobar = (int)a;"
    ]
)]

with condition1, condition2 and condition 3 strings. Other constructs would work the same. The labels can be discarded. I also need to include blocks not associated with any special statement, like Block([...]). Standard C language Python parsers dond't work (for instance pycparser gives syntax error) because of the unknown types

7
  • 2
    You're going to have to guess, since it's not actually possible to unambiguously parse C under these constraints. It's usually possible to make pretty decent guesses, but you'll still have to guess. Commented Mar 4, 2019 at 22:08
  • 2
    For example, is (a)&b a bitwise operation or a pointer cast? Who knows! Commented Mar 4, 2019 at 22:10
  • 1
    Consider writing a lexical analyser for this. en.wikipedia.org/wiki/Lexical_analysis . Commented Mar 4, 2019 at 22:12
  • 1
    In C write space characters can be ignored as long as they are not in a string and not dividing tokens. Commented Mar 4, 2019 at 22:12
  • 1
    What are you actually asking here? I mean, since the code presented does not conform to the C language as it stands, it is not surprising that existing parsers reject it. It follows, then, that if you need to parse it then you need either to modify the code or prepare your own parser. I suspect you're after the latter, but in that case, the implied question is far too broad. Commented Mar 4, 2019 at 22:46

1 Answer 1

1

Pyparsing includes a simple C parser as part of its examples, here is a parser that will process your sample code, and a little bit more (includes support for for statements).

This is not a very good C parser. It brushes broadly across if, while, and do conditions as just strings in nested parentheses. But it may give you a start on extracting what bits you are interested in.

import pyparsing as pp

IF, WHILE, DO, ELSE, FOR = map(pp.Keyword, "if while do else for".split())
SEMI, COLON, LBRACE, RBRACE = map(pp.Suppress, ';:{}')

stmt_body = pp.Forward()
single_stmt = pp.Forward()
stmt_block = stmt_body | single_stmt

if_condition = pp.ungroup(pp.nestedExpr('(', ')'))
while_condition = if_condition()
for_condition = if_condition()

if_stmt = pp.Group(IF 
           + if_condition("condition") 
           + stmt_block("bodyTrue")
           + pp.Optional(ELSE + stmt_block("bodyElse"))
           )
do_stmt = pp.Group(DO 
           + stmt_block("body") 
           + WHILE 
           + while_condition("condition")
           + SEMI
           )
while_stmt = pp.Group(WHILE + while_condition("condition")
              + stmt_block("body"))
for_stmt = pp.Group(FOR + for_condition("condition")
            + stmt_block("body"))
other_stmt = (~(LBRACE | RBRACE) + pp.SkipTo(SEMI) + SEMI)
single_stmt <<= if_stmt | do_stmt | while_stmt | for_stmt | other_stmt
stmt_body <<= pp.nestedExpr('{', '}', content=single_stmt)

label = pp.pyparsing_common.identifier + COLON

parser = pp.OneOrMore(stmt_block)
parser.ignore(label)

sample = """
if ( condition1 )
{
    func1( int hi );
    unktype foo;
    do
    {
        if ( condition2 )
            goto LABEL_1;
    }
    while ( condition3 );
}
else
{
    float a = bar(baz, 0);
LABEL_1:
    int foobar = (int)a;
}
"""

print(parser.parseString(sample).dump())

prints:

[['if', 'condition1', ['func1( int hi )', 'unktype foo', ['do', [['if', 'condition2', 'goto LABEL_1']], 'while', 'condition3']], 'else', ['float a = bar(baz, 0)', 'int foobar = (int)a']]]
[0]:
  ['if', 'condition1', ['func1( int hi )', 'unktype foo', ['do', [['if', 'condition2', 'goto LABEL_1']], 'while', 'condition3']], 'else', ['float a = bar(baz, 0)', 'int foobar = (int)a']]
  - bodyElse: ['float a = bar(baz, 0)', 'int foobar = (int)a']
  - bodyTrue: ['func1( int hi )', 'unktype foo', ['do', [['if', 'condition2', 'goto LABEL_1']], 'while', 'condition3']]
    [0]:
      func1( int hi )
    [1]:
      unktype foo
    [2]:
      ['do', [['if', 'condition2', 'goto LABEL_1']], 'while', 'condition3']
      - body: [['if', 'condition2', 'goto LABEL_1']]
        [0]:
          ['if', 'condition2', 'goto LABEL_1']
          - bodyTrue: 'goto LABEL_1'
          - condition: 'condition2'
      - condition: 'condition3'
  - condition: 'condition1'
Sign up to request clarification or add additional context in comments.

1 Comment

That looks promising, really cool! Pretty much what I'd like to get. Thank you so much Paul :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.