I'm trying to extract the following items from a C file:
- Comments (single and multi-line)
- String literals
- Decimal, octal and hexadecimal literals.
I've written the following regex to try and find those items:
/\*(?:.|[\r\n])*?\*/|"(?:[^"\\\r\n]|\\.)*"|//.*|\b\d+\b|\b0[xX][\da-fA-F]+\b
The expression is composed of five parts ORed together.
/\*(?:.|[\r\n])*?\*/checks for multi-line comments."(?:[^"\\\r\n]|\\.)*"checks for string literals.//.*checks for single line comments.\b\d+\bchecks for decimal and octal constants.\b0[xX][\da-fA-F]+\bchecks for hexadecimal constants.
Although the expression seems to work fine when tested using regexpal and a 500 line C file, my Java program throws a StackOverflowException after a few matches.
Here is the Java code that uses the regex:
Pattern itemsOfInterestPattern = Pattern.compile(
"/\\*(?:.|[\\r\\n])*?\\*/|\"(?:[^\"\\\\\\r\\n]|\\\\.)*\"|//.*|\\b\\d+\\b|\\b0[xX][\\da-fA-F]+\\b");
// Now, go through the source file and process any tags we find
Matcher itemsOfInterestMatcher = itemsOfInterestPattern.matcher(sourceFile);
int matchNumber = 0;
while (itemsOfInterestMatcher.find()) {
// We've found a token
++matchNumber;
String token = itemsOfInterestMatcher.group();
// I then have a switch statement that processes each match depending on its type
}
The stack trace when the overflow occurs can be found at http://pastebin.com/7eL6mVd2
What's causing the stack overflow and how can I change the expression to allow it to work?
Amr
0.5but\b\d+\bwill not match any part of floating values in scientific notation1e1, or integer literals with a size specifier:1L.