0

I tried to split a line based on spaces not enclosed between double quotes.

My regex is

(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+

My Code

Pattern regex          = Pattern.compile("(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+");
Matcher regexMatcher   = regex.matcher(line);
List<String> rule      = new ArrayList<String>();

while(regexMatcher.find())
    rule.add(regexMatcher.group());

Input for which it is failed.

SecRule REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|REQUEST_COOKIES_NAMES|ARGS_NAMES|ARGS|XML:/* "(?i:\b(?:(?:s(?:t(?:d(?:dev(_pop|_samp)?)?|r(?:_to_date|cmp))|u(?:b(?:str(?:ing(_index)?)?|(?:dat|tim)e)|m)|e(?:c(?:_to_time|ond)|ssion_user)|ys(?:tem_user|date)|ha(1|2)?|oundex|chema|ig?n|pace|qrt)|i(?:s(null|_(free_lock|ipv4_compat|ipv4_mapped|ipv4|ipv6|not_null|not|null|used_lock))?|n(?:et6?_(aton|ntoa)|s(?:ert|tr)|terval)?|f(null)?)|u(?:n(?:compress(?:ed_length)?|ix_timestamp|hex)|tc_(date|time|timestamp)|p(?:datexml|per)|uid(_short)?|case|ser)|l(?:o(?:ca(?:l(timestamp)?|te)|g(2|10)?|ad_file|wer)|ast(_day|_insert_id)?|e(?:(?:as|f)t|ngth)|case|trim|pad|n)|t(?:ime(stamp|stampadd|stampdiff|diff|_format|_to_sec)?|o_(base64|days|seconds|n?char)|r(?:uncate|im)|an)|m(?:a(?:ke(?:_set|date)|ster_pos_wait|x)|i(?:(?:crosecon)?d|n(?:ute)?)|o(?:nth(name)?|d)|d5)|r(?:e(?:p(?:lace|eat)|lease_lock|verse)|o(?:w_count|und)|a(?:dians|nd)|ight|trim|pad)|f(?:i(?:eld(_in_set)?|nd_in_set)|rom_(base64|days|unixtime)|o(?:und_rows|rmat)|loor)|a(?:es_(?:de|en)crypt|s(?:cii(str)?|in)|dd(?:dat|tim)e|(?:co|b)s|tan2?|vg)|p(?:o(?:sition|w(er)?)|eriod_(add|diff)|rocedure_analyse|assword|i)|b(?:i(?:t_(?:length|count|x?or|and)|n(_to_num)?)|enchmark)|e(?:x(?:p(?:ort_set)?|tract(value)?)|nc(?:rypt|ode)|lt)|v(?:a(?:r(?:_(?:sam|po)p|iance)|lues)|ersion)|g(?:r(?:oup_conca|eates)t|et_(format|lock))|o(?:(?:ld_passwo)?rd|ct(et_length)?)|we(?:ek(day|ofyear)?|ight_string)|n(?:o(?:t_in|w)|ame_const|ullif)|(rawton?)?hex(toraw)?|qu(?:arter|ote)|(pg_)?sleep|year(week)?|d?count|xmltype|hour)\W*\(|\b(?:(?:s(?:elect\b(?:.{1,100}?\b(?:(?:length|count|top)\b.{1,100}?\bfrom|from\b.{1,100}?\bwhere)|.*?\b(?:d(?:ump\b.*\bfrom|ata_type)|(?:to_(?:numbe|cha)|inst)r))|p_(?:sqlexec|sp_replwritetovarbin|sp_help|addextendedproc|is_srvrolemember|prepare|sp_password|execute(?:sql)?|makewebtask|oacreate)|ql_(?:longvarchar|variant))|xp_(?:reg(?:re(?:movemultistring|ad)|delete(?:value|key)|enum(?:value|key)s|addmultistring|write)|terminate|xp_servicecontrol|xp_ntsec_enumdomains|xp_terminate_process|e(?:xecresultset|numdsn)|availablemedia|loginconfig|cmdshell|filelist|dirtree|makecab|ntsec)|u(?:nion\b.{1,100}?\bselect|tl_(?:file|http))|d(?:b(?:a_users|ms_java)|elete\b\W*?\bfrom)|group\b.*\bby\b.{1,100}?\bhaving|open(?:rowset|owa_util|query)|load\b\W*?\bdata\b.*\binfile|(?:n?varcha|tbcreato)r|autonomous_transaction)\b|i(?:n(?:to\b\W*?\b(?:dump|out)file|sert\b\W*?\binto|ner\b\W*?\bjoin)\b|(?:f(?:\b\W*?\(\W*?\bbenchmark|null\b)|snull\b)\W*?\()|print\b\W*?\@\@|cast\b\W*?\()|c(?:(?:ur(?:rent_(?:time(?:stamp)?|date|user)|(?:dat|tim)e)|h(?:ar(?:(?:acter)?_length|set)?|r)|iel(?:ing)?|ast|r32)\W*\(|o(?:(?:n(?:v(?:ert(?:_tz)?)?|cat(?:_ws)?|nection_id)|(?:mpres)?s|ercibility|alesce|t)\W*\(|llation\W*\(a))|d(?:(?:a(?:t(?:e(?:(_(add|format|sub))?|diff)|abase)|y(name|ofmonth|ofweek|ofyear)?)|e(?:(?:s_(de|en)cryp|faul)t|grees|code)|ump)\W*\(|bms_\w+\.\b)|(?:;\W*?\b(?:shutdown|drop)|\@\@version)\b|\butl_inaddr\b|\bsys_context\b|'(?:s(?:qloledb|a)|msdasql|dbo)'))" "phase:2,rev:'2',ver:'OWASP_CRS/2.2.9',maturity:'9',accuracy:'8',capture,t:none,t:urlDecodeUni,ctl:auditLogParts=+E,block,msg:'SQL Injection Attack',id:'950001',tag:'OWASP_CRS/WEB_ATTACK/SQL_INJECTION',tag:'WASCTC/WASC-19',tag:'OWASP_TOP_10/A1',tag:'OWASP_AppSensor/CIE1',tag:'PCI/6.5.2',logdata:'Matched Data: %{TX.0} found within %{MATCHED_VAR_NAME}: %{MATCHED_VAR}',severity:'2',setvar:'tx.msg=%{rule.msg}',setvar:tx.sql_injection_score=+%{tx.critical_anomaly_score},setvar:tx.anomaly_score=+%{tx.critical_anomaly_score},setvar:tx.%{rule.id}-OWASP_CRS/WEB_ATTACK/SQL_INJECTION-%{matched_var_name}=%{tx.0}

When i used this in java, some lines are separated successfully, but some lines are causing errors

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4235)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)

Sample Input:

The "world \"is beautiful" but i "cannot see" it

Expected Output:

The
"world \" beautiful"
but
i
"cannot see"
it
12
  • Please post a complete (but short) program, along with the input causing the error. Commented Feb 11, 2014 at 12:21
  • 1
    @Dukeling: It is most likely caused by a long input. Java uses recursion in matching repetition. Commented Feb 11, 2014 at 12:22
  • @nhahtdh I have updated the question. How to overcome the error. Whether there is any problem in my regex Commented Feb 11, 2014 at 12:25
  • @KrishnaM: There are actually 2 things to fix: 1) your current regex is not working correctly 2) the stack overflow error. Need some time to test. Commented Feb 11, 2014 at 12:28
  • How about splitting on all spaces and putting pieces back together where piece 1 ends with " and piece 2 starts with " in a second step? That should be feasable without nasty regexs. Commented Feb 11, 2014 at 12:29

4 Answers 4

2

Why does StackOverflowError occur?

In reference implementation of Pattern class (which comes with Oracle's JRE, OpenJDK, and a number of other JVMs), greedy and lazy quantifiers are implemented with recursion1 when the repeated pattern is non-trivial. Therefore, you will run into StackOverflowError when the input string is long enough.

1 Recursion is a quick but not scalable solution to allow backtracking in the pattern. Better implementation uses a data structure to store backtracking points (which basically converts recursive solution to iterative solution with stack).

Solution

The following regex should work:

"(?:\"(?:[^\"\\\\]++|\\\\.)*+\"|[^ \"]++)++"

Well, the regex is quite confusing with 2 layers of escaping: escaping in Java string literal and escaping in regex syntax.

The raw regex when you print the string out. My explanation will be based on the raw regex.

(?:"(?:[^"\\]++|\\.)*+"|[^ "]++)++

Explanation

Since you only care about what the whole regex matches, all the capturing groups (pattern) has been turned into non-capturing group (?:pattern) for efficiency.

The first alternative "(?:[^"\\]++|\\.)*+" matches a quoted string.

The second alternative [^ "]++ matches a sequence of character that doesn't contain space and double quote ".

(?:
   "             # Double quote
   (?:
      [^"\\]++   # A sequence of characters that are not " and \
      |          # OR
      \\.        # Escape sequence: \ followed by any character (except line terminators)
   )*+           # Match 0 or more of the sequences above (allows empty string)
   "             # Double quote
   |
   [^ "]++
)++

Since the regex is written so that there is no needs for backtracking, all quantifiers are made possessive. Since Pattern class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers, StackOverflowError will not occur.

I remove the need for backtracking by writing the regex so that it matches the correct string on first try:

  1. Since [^"\\] excludes the \, there is no way we can "steal" a \ from an escaping sequence, or "steal" a " and mess up the closing quote, we can safely advance ahead without backtracking. This explains the possessive quantifier here [^"\\]++. There is no need to assign a quantifier here, but I do this to reduce the work on the branching.

  2. Since both [^"\\]++ and \\. can't "steal" a " and mess up the closing quote, we can advance ahead without backtracking. This explains the possessive quantifier here (?:[^"\\]++|\\.)*+

  3. [^ "] can't start a quoted string, and it also can't match a space (delimiter). This is why we can use possessive quantifier on it.

  4. Since "(?:[^"\\]++|\\.)*+" and [^ "]++ can't mess up the match of each other, we can make the outer most quantifier possessive.

Example of a regex that doesn't match things correctly on first try and only get the correct result after backtracking would be ^([bcd]+:[ab]+)+$ against inputs such as b:ab:a. The first iteration will match b:ab, which cause the 2nd iteration to fail, then it backtracks and retry with the first iteration being b:a and then successfully match the whole string.

Sign up to request clarification or add additional context in comments.

Comments

2

Your regex is broken:

(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+
  #### ####### ###
  |    |       ---------------- A dot
  |    ------------------------ Any character not "
  ----------------------------- A " (no need to put it in a character class)

At this point I stopped looking further, because I am sure this is not what you want.

By the way, I recommend writing the regex first and only then do the quoting (you could write yourself a tool that does this, it is purely mechanical: add one \ before every " and every \, then enclose in ""). Also, don't use character classes for single characters.

In fact, it appears what you're looking for are words, or strings. So, why don't you say just that.

You can use a top down approach:

REGEX = (WORD|STRING)
WORD  = \w+                  -- or \p{L} or something like that
STRING = "(SOMETHING)*"
SOMETHING = \\["\\]|[^\\"]    -- an escaped quote, an escaped backslash or 
                              -- something that is neither a backslash nor a quote

Now:

  1. Replace the uppercase words by their right hand sides
  2. Remove unnecessary (), if any
  3. Quote

You can test important sub-regexes separately, for example the STRING. Turned out I had several errors in my first version, and this even when writing unquoted! To write/discuss such a regex in the form java demands from the start is virtually impossible.

3 Comments

[^\\\"] is not ", since the RAW regex is [^\"].
Yes, @nhahtdh, makes even less sense, he probably meant [^\"] (after quoting)
Note that your grammar is broken: it allows "ab\"\" to be matched as STRING, since the last \ matches [^"]
0

Thanks for all your responses. Atlast i found my mistake. The actual reason is for stackoverflow is not my regex. My regex was correct. I used eclipse for coding. The actual reason for stackoverflow is my stack size. Intially my stacksize was 1Mb. I increased my stack size for the program in my eclipse and there was no error.

Java stack overflow error - how to increase the stack size in Eclipse?

UPDATE:

There is no need to change the stack size. As mentioned by nhahtdh, I have changed the regex to regex with possessive quantifier and there was no stackoverflow error.

My Regex is now ("([^\\"]|\\.)++"|[^\s]++)

To learn more about Possessive quantifier follow this link.

9 Comments

Increasing stack size is at most a patchy solution. When you have a longer string, you will hit the limit of your stack. The better solution is always to reduce the stack usage (as Jon Skeet suggested in the cited post), in this case, via the use of possessive quantifier instead of plain greedy quantifier.
I can get your point. But the reason my regex failed was due to the input string length. There were more than 2000 characters enclosed between a double quote. In that case I found no other solution other than increasing the stack size. Is there any other way for matching 2000 characters enclosed in the double quote without increasing stack size ?
Instead of using greedy quantifier, rewrite the regex to use possessive quantifier (it is not a direct switch in general cases, though). I explained in my answer Since Pattern class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers, StackOverflowError will not occur.
And another thing. Your regex is incorrect. Ingo's post has pointed out why it is incorrect. Your regex does not parse escaped double quote \" correctly.
@nhahtdh : I have changed the regex now and there is no stackoverflow error. Thanks for pointing my mistake. And there is no \" in my expression. I later learned that \ is used to denote that the rule continues in next line also. I have updated that in my problem statement.
|
-2

The first thing to try is to increase the stack size.

If that does not work, you might have hit a bug. You could try a different JVM and set up the JVM to use something other than OpenJDK for the class library, and tinker with the regex to see exactly what is triggering it.

8 Comments

Different JVM won't fix the problem, as long as they are using the same code base for Java Class Library. Increasing the stack size should be the last resort.
@nhahtdh can you explain why you are so sure that a bug in JVM cannot cause such problems?
You can say that it is a (long-running and a design) bug in Pattern class (which if part of Java Class Library) and I really doubt it is going to be fixed any time soon. Since Pattern uses recursion to match greedy and lazy quantifiers, it will run into StackOverflowError for large enough input. However, possessive quantifiers implements matching with a loop (since it does not allow backtracking), there will be no problem. As long as all the JVM uses the same implementation of Pattern class, you won't run away from the problem.
Hmmm, you are referring to specific case of Pattern and related classes while I was referring to the general problem of StackOverflowException in standard classes. I don't think that merits a downvote, but thanks for you explanation.
This is bad advice anyway. The first thing one should try after getting a StackOverflowException is write less inefficient code. It's not the regex engine that's blowing the stack here, it's the regex author.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.