9

I have this code in my C file:

printf("Worker name is %s and id is %d", worker.name, worker.id);

I want, with Python, to be able to parse the format string and locate the "%s" and "%d".

So I want to have a function:

>>> my_function("Worker name is %s and id is %d")
[Out1]: ((15, "%s"), (28, "%d))

I've tried to achieve this using libclang's Python bindings, and with pycparser, but I didn't see how can this be done with these tools.

I've also tried using regex to solve this, but this is not simple at all - think about use cases when the printf has "%%s" and stuff like this.

Both gcc and clang obviously do this as part of compiling - have no one exported this logic to Python?

7
  • All I want to do, is simply to locate the "%d" and "%s" inside the string - to know their indexes if you will, and not to convert this to a Python print Commented May 3, 2015 at 7:42
  • you cannot easily parse it with a simple regex, you need to handle char by char. Commented May 3, 2015 at 7:45
  • This is of course possible, but not simple, I'll rather avoid it. It's weird that this logic, which is inside gcc and clang, is not available in Python, also in c parsing libraries Commented May 3, 2015 at 7:48
  • 1
    Both gcc and clang obviously do this as part of compiling No. This is done at runtime. gcc sees simply a string. Commented May 3, 2015 at 8:37
  • 1
    Actually both of them also do do this as part of compiling when generating warnings for -Wformat. C compiler is not required to do this, which does not mean that no C compiler does it :D Commented May 3, 2015 at 8:53

3 Answers 3

9

You can certainly find properly formatted candidates with a regex.

Take a look at the definition of the C Format Specification. (Using Microsofts, but use what you want.)

It is:

%[flags] [width] [.precision] [{h | l | ll | w | I | I32 | I64}] type

You also have the special case of %% which becomes % in printf.

You can translate that pattern into a regex:

(                                 # start of capture group 1
%                                 # literal "%"
(?:                               # first option
(?:[-+0 #]{0,5})                  # optional flags
(?:\d+|\*)?                       # width
(?:\.(?:\d+|\*))?                 # precision
(?:h|l|ll|w|I|I32|I64)?           # size
[cCdiouxXeEfgGaAnpsSZ]            # type
) |                               # OR
%%)                               # literal "%%"

Demo

And then into a Python regex:

import re

lines='''\
Worker name is %s and id is %d
That is %i%%
%c
Decimal: %d  Justified: %.6d
%10c%5hc%5C%5lc
The temp is %.*f
%ss%lii
%*.*s | %.3d | %lC | %s%%%02d'''

cfmt='''\
(                                  # start of capture group 1
%                                  # literal "%"
(?:                                # first option
(?:[-+0 #]{0,5})                   # optional flags
(?:\d+|\*)?                        # width
(?:\.(?:\d+|\*))?                  # precision
(?:h|l|ll|w|I|I32|I64)?            # size
[cCdiouxXeEfgGaAnpsSZ]             # type
) |                                # OR
%%)                                # literal "%%"
'''

for line in lines.splitlines():
    print '"{}"\n\t{}\n'.format(line, 
           tuple((m.start(1), m.group(1)) for m in re.finditer(cfmt, line, flags=re.X))) 

Prints:

"Worker name is %s and id is %d"
    ((15, '%s'), (28, '%d'))

"That is %i%%"
    ((8, '%i'), (10, '%%'))

"%c"
    ((0, '%c'),)

"Decimal: %d  Justified: %.6d"
    ((9, '%d'), (24, '%.6d'))

"%10c%5hc%5C%5lc"
    ((0, '%10c'), (4, '%5hc'), (8, '%5C'), (11, '%5lc'))

"The temp is %.*f"
    ((12, '%.*f'),)

"%ss%lii"
    ((0, '%s'), (3, '%li'))

"%*.*s | %.3d | %lC | %s%%%02d"
    ((0, '%*.*s'), (8, '%.3d'), (15, '%lC'), (21, '%s'), (23, '%%'), (25, '%02d'))
Sign up to request clarification or add additional context in comments.

Comments

1

A simple implementation might be the following generator:

def find_format_specifiers(s):
    last_percent = False
    for i in range(len(s)):
        if s[i] == "%" and not last_percent:
            if s[i+1] != "%":
                yield (i, s[i:i+2])
            last_percent = True
        else:
            last_percent = False

>>> list(find_format_specifiers("Worker name is %s and id is %d but %%q"))
[(15, '%s'), (28, '%d')]

This can be fairly easily extended to handle additional format specifier information like width and precision, if needed.

5 Comments

Strangely enough "%-0.3%" is a valid format specifier (meaning "%" and not using any argument)
Yeah, as mentioned my answer doesn't handle any extra embellishment between the leading % and the type specifier because the OP didn't ask for that.
Sorry for the noise... i realized now that the OP is asking about C formatting strings, not Python old-style formatting strings
@6502 Your concerns are valid, even in C. However, writing a top printf parser would be probably too much for sunday morning - at least for me :)
... I now realized that, and therefore deleted my answer. However for a stable solution we'll need a perfect format string parser which handles all the edge cases since those egde cases exist in almost every software project.
0

this is an iterative code i have written that prints the indexes of %s %d or any such format string

            import re  
            def myfunc(str):
                match = re.search('\(.*?\)',str)
                if match:
                    new_str = match.group()
                    new_str = new_str.translate(None,''.join(['(',')','"'])) #replace the characters in list with none
                    print new_str
                    parse(new_str)
                else:
                    print "No match"

            def parse(str):
                try:
                    g = str.index('%')
                    print " %",str[g+1]," = ",g
                    #replace % with ' '
                    list1 = list(str)
                    list1[str.index('%')] = ' '
                    str = ''.join(list1)

                    parse(str)
                except ValueError,e:
                    return

            str = raw_input()
            myfunc(str)`

hope it helps

1 Comment

Thank you! It's a great start for me, even though it doesn't cover all the cases - such as %*d and stuff like that

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.