Java : Matcher.find using high cpu utilization

Question

I am using mod security rules https://github.com/SpiderLabs/owasp-modsecurity-crs to sanitize user input data. I am facing cpu shoot up and delay in matching the user input with mod security rule regular expressions. Overall it contains 500+ regular expression to check different types of attacks(xss , badrobots , generic and sql). For each request , I go through all parameters and check against all these 500 regular expressions. I am using Matcher.find to check the parameters. In this case some parameters fall in infinite looping , I tackled this using below technique.

Cancelling a long running regex match?.

Sanitize a user request took around ~500 ms and cpu % shoots up. I analyzed using visualvm.java.net with my test suite runner.

Cpu Profile Output

enter image description here

Please help me to reduce the cpu usage % and load average?

According to the screenshot checkPattern was called 212148825 times totalling to 6100774ms, that makes 0.02ms per call. I don't see a performance problem there - and definitely no proof of 500ms per invocation. — user330315
– user330315, Commented Aug 31, 2013 at 22:07
If there are particular patterns causing longer delays you should identify them and include them in your question. — Holger
– Holger, Commented Sep 17, 2013 at 17:15
@Holger time taken is not a problem. My concern is only about load and the cpu usage. I want to process parallel process the parameters , If I did that my load average went to > 4. I took thread dump using jstack -l and find the max consume thread using thread -H -b -p <process id> and converted the id into hex code , thread which consumes high cpu (50 %) are in runnable state at Matcher.find. — kannanrbk
– kannanrbk, Commented Sep 18, 2013 at 19:05
@bharathi: That’s still the same kind of problem. It’s very likely that a lot of the patterns are cheap but some of them are expensive. You have to identify the expensive ones to concentrate on optimizing these specific patterns. By the way, Matcher.find invokes other methods which are not shown due to the profiler’s default filtering rules. Changing these rules to allow looking inside the JDK’s methods can cast light on where inside the matching most of the time (implies cpu load) is spent. — Holger
– Holger, Commented Sep 19, 2013 at 7:53

Community · Accepted Answer · 2017-05-23 11:47:36Z

3

+25

If possible, compile your regexes once and keep them around, rather than repeatedly (implicitly) compiling (especially inside a loop).
See java.util.regex - importance of Pattern.compile()? for more info.

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Sep 18, 2013 at 23:06

Edward

1,0005 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kannanrbk Over a year ago

I'm already doing this. Precompiled all matching patterns and stored it's a pattern list.

Ryan · Accepted Answer · 2013-09-20 18:03:03Z

3

I suggest you look at this paper: "Towards Faster String Matching for Intrusion Detection or Exceeding the Speed of Snort"

There are better ways to do the matching you describe. Essentially you take the 500 patterns you want to match and compile it into a single suffix tree which can very efficiently match an input against all the rules at once.

The paper explains that this approach was described as "Boyer-Moore Approach to Exact Set Matching" by Dan Gusfield.

Boyer-Moore is a well known algorithm for String matching. The paper describes a variation on Boyer-Moore for Set Matching.

edited Sep 20, 2013 at 18:03

answered Sep 19, 2013 at 17:59

Ryan

2,10117 silver badges28 bronze badges

Comments

Piotr Kołaczkowski · Accepted Answer · 2013-09-23 11:31:11Z

I think this is the root of your problem, not the regex performance per-se:

For each request , I go through all parameters and check against all these 500 regular expressions

No matter how fast your regex will be, this is still plenty of work. I don't know how many parameters you have, but even if there are only a few, that's still checking thousands of regular expressions per request. That can kill your CPU.

Apart from the obvious things like improving your regex performance by precompiling and/or simplifying them, you can do the following things to reduce the amount of regex checking:

Use positive-validation of user input based on the parameter type. E.g. if some parameter must be a simple number, don't waste time checking if it contains malicious XML script. Just check whether it matches [0-9]+ (or something similarly simple). If it does, it is ok - skip checking all the 500 regexps.
Try to find simple regexps that could eliminate the whole classes of attacks - find common things in your regexps. If e.g. you've got 100 regexps checking for existence of certain HTML tags, check if the content contains at least one HTML tag first. If it doesn't, you immediately save on checking 100 regexps.
Cache results. Many parameters generated in webapps repeat themselves. Don't check the same content over and over again, but just remember the final validation result. Beware to limit the maximum size of the cache to avoid DOS attacks.

Also note that negative-validation is usually easy to bypass. Someone just changes a few chars in their malicious code and your regexps won't match. You'll have to grow your "database" of regexps in order to protect against new attacks. Positive validation (whitelisting) doesn't have this disadvantage and is much more effective.

Hi, I did the first two steps. Now , the total execution time is better than previous. Gained around 60ms.

pabloa98 · Accepted Answer · 2013-08-31 21:14:47Z

2

Avoid expresions with:

Multiline
case insensitive
etc.

Perhaps you can consider grouping regular expressions and apply a given group of regular expresions depending on user input.

answered Aug 31, 2013 at 21:14

pabloa98

6187 silver badges17 bronze badges

2 Comments

Piotr Kołaczkowski Over a year ago

This is plain wrong. Why not multiline / case insensitive? Regex matching in Java is based on NFA + backtracking, so things like case-sensitiveness do not affect performance much. Much more important is to avoid backtracking, e.g. .* followed by alteration (a|b|c).

pabloa98 Over a year ago

As you said, case-sensitiveness and multiline search do not affect the performance much. That is another way to declare that they affect the performance. Based in the performance requirements it could be relevant or no. If performance is required, backtracking must not be used. Never.

davide · Accepted Answer · 2013-09-19 14:05:54Z

1

If you have such a big number of regex, you could group (at least some of) them using a trie algorithm (http://en.wikipedia.org/wiki/Trie).
The idea is that if you have for example regexes like /abc[0-9-]/, /abde/, /another example/, /.something else/ and /.I run out of ideas/, you can combine them into the single regex

 /a(?:b(?:c[0-9-]|de)|nother example)|.(?:I run out of ideas|something else)/

In this way, the matcher has to run only once instead of four times, and you avoid a lot of backtracking, because of how the common starting parts have been written in the regex above.

answered Sep 19, 2013 at 14:05

davide

1,9482 gold badges20 silver badges30 bronze badges

2 Comments

kannanrbk Over a year ago

Hi davide, I can't group it. Because , I need to get the matched rule(mod security rule , each rules has it's own attributes) details.

davide Over a year ago

In principle, if the matched rules are just some among the 500, you could prepare packages of regexes, using the above procedure to make the a big regex per package. When one of the big regexes finds a match, you can check the original rules that forms the package. To make this method effective, you should cluster the rules that are more likely to appear together. I hope its feasible.

Andrey Chaschev · Accepted Answer · 2013-09-20 14:37:24Z

1

There must be a subset of problematic regexes among these 500. I.e. such a regex

    String s = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB";

    Pattern.compile("(A+)+").matcher(s).matches();

will take years to complete.

So in your case I would log all the problematic regexes with their problematic inputs. Once these are found you could manually rewrite these few problematic regexes and test them versus the original. Regexes can always be rewritten with a simpler and more readable java functions.

Another option, though it would not resolve the problem above, is you could also utilize a faster (x20 in some cases) and more limited regex library. It is available in Maven Central.

answered Sep 20, 2013 at 14:37

Andrey Chaschev

16.6k5 gold badges56 silver badges70 bronze badges

Collectives™ on Stack Overflow

Java : Matcher.find using high cpu utilization

6 Answers 6

1 Comment

Comments

1 Comment

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

Comments

1 Comment

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related