1

I'm trying to extract the following items from a C file:

  • Comments (single and multi-line)
  • String literals
  • Decimal, octal and hexadecimal literals.

I've written the following regex to try and find those items:

/\*(?:.|[\r\n])*?\*/|"(?:[^"\\\r\n]|\\.)*"|//.*|\b\d+\b|\b0[xX][\da-fA-F]+\b

The expression is composed of five parts ORed together.

  • /\*(?:.|[\r\n])*?\*/ checks for multi-line comments.
  • "(?:[^"\\\r\n]|\\.)*" checks for string literals.
  • //.* checks for single line comments.
  • \b\d+\b checks for decimal and octal constants.
  • \b0[xX][\da-fA-F]+\b checks for hexadecimal constants.

Although the expression seems to work fine when tested using regexpal and a 500 line C file, my Java program throws a StackOverflowException after a few matches.

Here is the Java code that uses the regex:

Pattern itemsOfInterestPattern = Pattern.compile(
        "/\\*(?:.|[\\r\\n])*?\\*/|\"(?:[^\"\\\\\\r\\n]|\\\\.)*\"|//.*|\\b\\d+\\b|\\b0[xX][\\da-fA-F]+\\b");
// Now, go through the source file and process any tags we find
Matcher itemsOfInterestMatcher = itemsOfInterestPattern.matcher(sourceFile);
int matchNumber = 0;
while (itemsOfInterestMatcher.find()) {
    // We've found a token
    ++matchNumber;
    String token = itemsOfInterestMatcher.group();
    // I then have a switch statement that processes each match depending on its type
}

The stack trace when the overflow occurs can be found at http://pastebin.com/7eL6mVd2

What's causing the stack overflow and how can I change the expression to allow it to work?

Amr

5
  • 3
    It might have something to do with the actual java code. Mind posting it? Commented Mar 25, 2012 at 13:06
  • the non-capturing (?:) might be problematic since its processing consumes much stack. Commented Mar 25, 2012 at 13:21
  • 1
    Your number literal patterns will match the integer portion and fraction portions of 0.5 but \b\d+\b will not match any part of floating values in scientific notation 1e1, or integer literals with a size specifier: 1L. Commented Mar 25, 2012 at 13:22
  • @MikeSamuel: Thanks for pointing that out - once I get the expression running in Java, I'll look into modifying it to match the conditions you highlighted. Commented Mar 25, 2012 at 13:24
  • @MikeSamuel: It also doesn't capture negative numbers...bugger, got to work on that part... Commented Mar 25, 2012 at 13:31

2 Answers 2

2

Judging from the number of times that java.util.regex.Pattern$LazyLoop.match(...) appears in the stack-trace, I'm betting the problem is the use of the reluctant quantifier *?: first it tries to match nothing, then it backtracks and tries to match one character, then it backtracks and tries to match two characters, and so on. So if you have a long comment, it will have to do a lot of backtracking, which apparently involves recursion. (I don't know if all backtracking involves recursion, or just reluctant-quantifier backtracking; in fact, until now, I didn't even realize that reluctant-quantifier backtracking did.) If you change this part:

/\*(?:.|[\r\n])*?\*/

to this:

/\*(?:[^*]|\*(?!/))*+\*/

(using the possessive quantifier *+ instead — it tries to match as much as it can, and never gives anything back), I think you'll find you can handle long comments much better. So, overall, your string-literal would look like this:

"/\\*(?:[^*]|\\*(?!/))*+\\*/|\"(?:[^\"\\\\\\r\\n]|\\\\.)*\"|//.*|\\b\\d+\\b|\\b0[xX][\\da-fA-F]+\\b"

Edited to add (July '13): Someone at my company had a similar issue recently, which led me to look a bit deeper into the cause. What I found is, the problem isn't the backtracking alone, but the combination of the backtracking with the subgroup; for example, a* or a*? would not have this problem, but (a)* or (a)*? or (?:a)* or (?:a)*? would. Above, I suggested disabling backtracking, by using *+ instead of *? (and making the necessary changes to the subexpression); but another approach would have been to eliminate the subexpression, by changing this:

/\*(?:.|[\r\n])*?\*/

to this:

/\*(?s:.*?)\*/

(where the (?s:...) notation is equivalent to ..., except that it locally turns on MULTILINE mode, meaning that . will match any character, even \n). The .*? doesn't require recursion in order to enable backtracking.

That said, I think the *+ approach is better in this case, and perhaps in most cases, since its algorithmic time complexity is lower. (.*? requires constantly trying to match and re-match the rest of the pattern; it can perform arbitrary backtracking without overflowing the stack, but it can take an inordinate amount of time to do so.)

Sign up to request clarification or add additional context in comments.

1 Comment

You are correct in your deductions. I did some more testing and determined that large multi line comments were what was causing the stack overflow. Your solution allows me to match large comments without any problem. I need to go away now and learn more about how the regex engine behaves so I can avoid these problems in the future..
0

Given the error (we don't see the code:() Important tip is that the actual answer is usually hidden somewhere in the first ten lines of the stack trace. You need to read it several times and then check the code. It seems that most of the errors are to do with regex except first two:

at java.lang.AbstractStringBuilder.charAt(AbstractStringBuilder.java:173)
at java.lang.StringBuilder.charAt(StringBuilder.java:55)

IMHO you should check these two lines (maybe with a debugger). Other posts say that such errors may be thrown when you run out of memory - How to validate big xml against xsd schema?. Try smaller file with less comments to start with and check if this error still occurs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.