Stackoverflow in pattern matching in java

Question

I tried to split a line based on spaces not enclosed between double quotes.

My regex is

(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+

My Code

Pattern regex          = Pattern.compile("(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+");
Matcher regexMatcher   = regex.matcher(line);
List<String> rule      = new ArrayList<String>();

while(regexMatcher.find())
    rule.add(regexMatcher.group());

Input for which it is failed.

SecRule REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|REQUEST_COOKIES_NAMES|ARGS_NAMES|ARGS|XML:/* "(?i:\b(?:(?:s(?:t(?:d(?:dev(_pop|_samp)?)?|r(?:_to_date|cmp))|u(?:b(?:str(?:ing(_index)?)?|(?:dat|tim)e)|m)|e(?:c(?:_to_time|ond)|ssion_user)|ys(?:tem_user|date)|ha(1|2)?|oundex|chema|ig?n|pace|qrt)|i(?:s(null|_(free_lock|ipv4_compat|ipv4_mapped|ipv4|ipv6|not_null|not|null|used_lock))?|n(?:et6?_(aton|ntoa)|s(?:ert|tr)|terval)?|f(null)?)|u(?:n(?:compress(?:ed_length)?|ix_timestamp|hex)|tc_(date|time|timestamp)|p(?:datexml|per)|uid(_short)?|case|ser)|l(?:o(?:ca(?:l(timestamp)?|te)|g(2|10)?|ad_file|wer)|ast(_day|_insert_id)?|e(?:(?:as|f)t|ngth)|case|trim|pad|n)|t(?:ime(stamp|stampadd|stampdiff|diff|_format|_to_sec)?|o_(base64|days|seconds|n?char)|r(?:uncate|im)|an)|m(?:a(?:ke(?:_set|date)|ster_pos_wait|x)|i(?:(?:crosecon)?d|n(?:ute)?)|o(?:nth(name)?|d)|d5)|r(?:e(?:p(?:lace|eat)|lease_lock|verse)|o(?:w_count|und)|a(?:dians|nd)|ight|trim|pad)|f(?:i(?:eld(_in_set)?|nd_in_set)|rom_(base64|days|unixtime)|o(?:und_rows|rmat)|loor)|a(?:es_(?:de|en)crypt|s(?:cii(str)?|in)|dd(?:dat|tim)e|(?:co|b)s|tan2?|vg)|p(?:o(?:sition|w(er)?)|eriod_(add|diff)|rocedure_analyse|assword|i)|b(?:i(?:t_(?:length|count|x?or|and)|n(_to_num)?)|enchmark)|e(?:x(?:p(?:ort_set)?|tract(value)?)|nc(?:rypt|ode)|lt)|v(?:a(?:r(?:_(?:sam|po)p|iance)|lues)|ersion)|g(?:r(?:oup_conca|eates)t|et_(format|lock))|o(?:(?:ld_passwo)?rd|ct(et_length)?)|we(?:ek(day|ofyear)?|ight_string)|n(?:o(?:t_in|w)|ame_const|ullif)|(rawton?)?hex(toraw)?|qu(?:arter|ote)|(pg_)?sleep|year(week)?|d?count|xmltype|hour)\W*\(|\b(?:(?:s(?:elect\b(?:.{1,100}?\b(?:(?:length|count|top)\b.{1,100}?\bfrom|from\b.{1,100}?\bwhere)|.*?\b(?:d(?:ump\b.*\bfrom|ata_type)|(?:to_(?:numbe|cha)|inst)r))|p_(?:sqlexec|sp_replwritetovarbin|sp_help|addextendedproc|is_srvrolemember|prepare|sp_password|execute(?:sql)?|makewebtask|oacreate)|ql_(?:longvarchar|variant))|xp_(?:reg(?:re(?:movemultistring|ad)|delete(?:value|key)|enum(?:value|key)s|addmultistring|write)|terminate|xp_servicecontrol|xp_ntsec_enumdomains|xp_terminate_process|e(?:xecresultset|numdsn)|availablemedia|loginconfig|cmdshell|filelist|dirtree|makecab|ntsec)|u(?:nion\b.{1,100}?\bselect|tl_(?:file|http))|d(?:b(?:a_users|ms_java)|elete\b\W*?\bfrom)|group\b.*\bby\b.{1,100}?\bhaving|open(?:rowset|owa_util|query)|load\b\W*?\bdata\b.*\binfile|(?:n?varcha|tbcreato)r|autonomous_transaction)\b|i(?:n(?:to\b\W*?\b(?:dump|out)file|sert\b\W*?\binto|ner\b\W*?\bjoin)\b|(?:f(?:\b\W*?\(\W*?\bbenchmark|null\b)|snull\b)\W*?\()|print\b\W*?\@\@|cast\b\W*?\()|c(?:(?:ur(?:rent_(?:time(?:stamp)?|date|user)|(?:dat|tim)e)|h(?:ar(?:(?:acter)?_length|set)?|r)|iel(?:ing)?|ast|r32)\W*\(|o(?:(?:n(?:v(?:ert(?:_tz)?)?|cat(?:_ws)?|nection_id)|(?:mpres)?s|ercibility|alesce|t)\W*\(|llation\W*\(a))|d(?:(?:a(?:t(?:e(?:(_(add|format|sub))?|diff)|abase)|y(name|ofmonth|ofweek|ofyear)?)|e(?:(?:s_(de|en)cryp|faul)t|grees|code)|ump)\W*\(|bms_\w+\.\b)|(?:;\W*?\b(?:shutdown|drop)|\@\@version)\b|\butl_inaddr\b|\bsys_context\b|'(?:s(?:qloledb|a)|msdasql|dbo)'))" "phase:2,rev:'2',ver:'OWASP_CRS/2.2.9',maturity:'9',accuracy:'8',capture,t:none,t:urlDecodeUni,ctl:auditLogParts=+E,block,msg:'SQL Injection Attack',id:'950001',tag:'OWASP_CRS/WEB_ATTACK/SQL_INJECTION',tag:'WASCTC/WASC-19',tag:'OWASP_TOP_10/A1',tag:'OWASP_AppSensor/CIE1',tag:'PCI/6.5.2',logdata:'Matched Data: %{TX.0} found within %{MATCHED_VAR_NAME}: %{MATCHED_VAR}',severity:'2',setvar:'tx.msg=%{rule.msg}',setvar:tx.sql_injection_score=+%{tx.critical_anomaly_score},setvar:tx.anomaly_score=+%{tx.critical_anomaly_score},setvar:tx.%{rule.id}-OWASP_CRS/WEB_ATTACK/SQL_INJECTION-%{matched_var_name}=%{tx.0}

When i used this in java, some lines are separated successfully, but some lines are causing errors

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4235)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)

Sample Input:

The "world \"is beautiful" but i "cannot see" it

Expected Output:

The
"world \" beautiful"
but
i
"cannot see"
it

Please post a complete (but short) program, along with the input causing the error. — Bernhard Barker
– Bernhard Barker, Commented Feb 11, 2014 at 12:21
@Dukeling: It is most likely caused by a long input. Java uses recursion in matching repetition. — nhahtdh
– nhahtdh, Commented Feb 11, 2014 at 12:22
@nhahtdh I have updated the question. How to overcome the error. Whether there is any problem in my regex — Krishna M
– Krishna M, Commented Feb 11, 2014 at 12:25
@KrishnaM: There are actually 2 things to fix: 1) your current regex is not working correctly 2) the stack overflow error. Need some time to test. — nhahtdh
– nhahtdh, Commented Feb 11, 2014 at 12:28
How about splitting on all spaces and putting pieces back together where piece 1 ends with " and piece 2 starts with " in a second step? That should be feasable without nasty regexs. — Fildor
– Fildor, Commented Feb 11, 2014 at 12:29

nhahtdh · Accepted Answer · 2014-03-07 04:44:57Z

Why does StackOverflowError occur?

In reference implementation of Pattern class (which comes with Oracle's JRE, OpenJDK, and a number of other JVMs), greedy and lazy quantifiers are implemented with recursion¹ when the repeated pattern is non-trivial. Therefore, you will run into StackOverflowError when the input string is long enough.

^{¹ Recursion is a quick but not scalable solution to allow backtracking in the pattern. Better implementation uses a data structure to store backtracking points
(which basically converts recursive solution to iterative solution with stack).}

Solution

The following regex should work:

"(?:\"(?:[^\"\\\\]++|\\\\.)*+\"|[^ \"]++)++"

Well, the regex is quite confusing with 2 layers of escaping: escaping in Java string literal and escaping in regex syntax.

The raw regex when you print the string out. My explanation will be based on the raw regex.

(?:"(?:[^"\\]++|\\.)*+"|[^ "]++)++

Explanation

Since you only care about what the whole regex matches, all the capturing groups (pattern) has been turned into non-capturing group (?:pattern) for efficiency.

The first alternative "(?:[^"\\]++|\\.)*+" matches a quoted string.

The second alternative [^ "]++ matches a sequence of character that doesn't contain space and double quote ".

(?:
   "             # Double quote
   (?:
      [^"\\]++   # A sequence of characters that are not " and \
      |          # OR
      \\.        # Escape sequence: \ followed by any character (except line terminators)
   )*+           # Match 0 or more of the sequences above (allows empty string)
   "             # Double quote
   |
   [^ "]++
)++

Since the regex is written so that there is no needs for backtracking, all quantifiers are made possessive. Since Pattern class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers, StackOverflowError will not occur.

I remove the need for backtracking by writing the regex so that it matches the correct string on first try:

Since [^"\\] excludes the \, there is no way we can "steal" a \ from an escaping sequence, or "steal" a " and mess up the closing quote, we can safely advance ahead without backtracking. This explains the possessive quantifier here [^"\\]++. There is no need to assign a quantifier here, but I do this to reduce the work on the branching.
Since both [^"\\]++ and \\. can't "steal" a " and mess up the closing quote, we can advance ahead without backtracking. This explains the possessive quantifier here (?:[^"\\]++|\\.)*+
[^ "] can't start a quoted string, and it also can't match a space (delimiter). This is why we can use possessive quantifier on it.
Since "(?:[^"\\]++|\\.)*+" and [^ "]++ can't mess up the match of each other, we can make the outer most quantifier possessive.

Example of a regex that doesn't match things correctly on first try and only get the correct result after backtracking would be ^([bcd]+:[ab]+)+$ against inputs such as b:ab:a. The first iteration will match b:ab, which cause the 2nd iteration to fail, then it backtracks and retry with the first iteration being b:a and then successfully match the whole string.

Ingo · Accepted Answer · 2014-02-11 14:35:10Z

2

Your regex is broken:

(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+
  #### ####### ###
  |    |       ---------------- A dot
  |    ------------------------ Any character not "
  ----------------------------- A " (no need to put it in a character class)

At this point I stopped looking further, because I am sure this is not what you want.

By the way, I recommend writing the regex first and only then do the quoting (you could write yourself a tool that does this, it is purely mechanical: add one \ before every " and every \, then enclose in ""). Also, don't use character classes for single characters.

In fact, it appears what you're looking for are words, or strings. So, why don't you say just that.

You can use a top down approach:

REGEX = (WORD|STRING)
WORD  = \w+                  -- or \p{L} or something like that
STRING = "(SOMETHING)*"
SOMETHING = \\["\\]|[^\\"]    -- an escaped quote, an escaped backslash or 
                              -- something that is neither a backslash nor a quote

Now:

Replace the uppercase words by their right hand sides
Remove unnecessary (), if any
Quote

You can test important sub-regexes separately, for example the STRING. Turned out I had several errors in my first version, and this even when writing unquoted! To write/discuss such a regex in the form java demands from the start is virtually impossible.

edited Feb 11, 2014 at 14:35

answered Feb 11, 2014 at 12:39

Ingo

36.4k6 gold badges57 silver badges102 bronze badges

3 Comments

nhahtdh Over a year ago

[^\\\"] is not ", since the RAW regex is [^\"].

Ingo Over a year ago

Yes, @nhahtdh, makes even less sense, he probably meant [^\"] (after quoting)

nhahtdh Over a year ago

Note that your grammar is broken: it allows "ab\"\" to be matched as STRING, since the last \ matches [^"]

Community · Accepted Answer · 2017-05-23 12:26:09Z

0

Thanks for all your responses. Atlast i found my mistake. The actual reason is for stackoverflow is not my regex. My regex was correct. I used eclipse for coding. The actual reason for stackoverflow is my stack size. Intially my stacksize was 1Mb. I increased my stack size for the program in my eclipse and there was no error.

Java stack overflow error - how to increase the stack size in Eclipse?

UPDATE:

There is no need to change the stack size. As mentioned by nhahtdh, I have changed the regex to regex with possessive quantifier and there was no stackoverflow error.

My Regex is now ("([^\\"]|\\.)++"|[^\s]++)

To learn more about Possessive quantifier follow this link.

edited May 23, 2017 at 12:26

CommunityBot

11 silver badge

answered Feb 13, 2014 at 10:15

Krishna M

1,2152 gold badges18 silver badges34 bronze badges

9 Comments

nhahtdh Over a year ago

Increasing stack size is at most a patchy solution. When you have a longer string, you will hit the limit of your stack. The better solution is always to reduce the stack usage (as Jon Skeet suggested in the cited post), in this case, via the use of possessive quantifier instead of plain greedy quantifier.

Krishna M Over a year ago

I can get your point. But the reason my regex failed was due to the input string length. There were more than 2000 characters enclosed between a double quote. In that case I found no other solution other than increasing the stack size. Is there any other way for matching 2000 characters enclosed in the double quote without increasing stack size ?

nhahtdh Over a year ago

Instead of using greedy quantifier, rewrite the regex to use possessive quantifier (it is not a direct switch in general cases, though). I explained in my answer

Since Pattern class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers, StackOverflowError will not occur.

nhahtdh Over a year ago

And another thing. Your regex is incorrect. Ingo's post has pointed out why it is incorrect. Your regex does not parse escaped double quote \" correctly.

Krishna M Over a year ago

@nhahtdh : I have changed the regex now and there is no stackoverflow error. Thanks for pointing my mistake. And there is no \" in my expression. I later learned that \ is used to denote that the rule continues in next line also. I have updated that in my problem statement.

|

Community · Accepted Answer · 2017-05-23 12:26:09Z

-2

The first thing to try is to increase the stack size.

If that does not work, you might have hit a bug. You could try a different JVM and set up the JVM to use something other than OpenJDK for the class library, and tinker with the regex to see exactly what is triggering it.

edited May 23, 2017 at 12:26

CommunityBot

11 silver badge

answered Feb 11, 2014 at 12:30

Miserable Variable

28.8k15 gold badges79 silver badges138 bronze badges

8 Comments

nhahtdh Over a year ago

Different JVM won't fix the problem, as long as they are using the same code base for Java Class Library. Increasing the stack size should be the last resort.

Miserable Variable Over a year ago

@nhahtdh can you explain why you are so sure that a bug in JVM cannot cause such problems?

nhahtdh Over a year ago

You can say that it is a (long-running and a design) bug in Pattern class (which if part of Java Class Library) and I really doubt it is going to be fixed any time soon. Since Pattern uses recursion to match greedy and lazy quantifiers, it will run into StackOverflowError for large enough input. However, possessive quantifiers implements matching with a loop (since it does not allow backtracking), there will be no problem. As long as all the JVM uses the same implementation of Pattern class, you won't run away from the problem.

Miserable Variable Over a year ago

Hmmm, you are referring to specific case of Pattern and related classes while I was referring to the general problem of StackOverflowException in standard classes. I don't think that merits a downvote, but thanks for you explanation.

Alan Moore Over a year ago

This is bad advice anyway. The first thing one should try after getting a StackOverflowException is write less inefficient code. It's not the regex engine that's blowing the stack here, it's the regex author.

|

Collectives™ on Stack Overflow

Stackoverflow in pattern matching in java

4 Answers 4

Why does StackOverflowError occur?

Solution

Explanation

Comments

3 Comments

9 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Why does StackOverflowError occur?

Solution

Explanation

Comments

3 Comments

9 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related