4

I'm facing a problem with Regex performance in C#.

I need to replace on a very large string (270k charachters, don't ask why..). The regex matches about 3k times.

private static Regex emptyCSSRulesetRegex = new Regex(@"[^\};\{]+\{\s*\}", RegexOptions.Compiled | RegexOptions.Singleline);

public string ReplaceEmptyCSSRulesets(string css) {
  return emptyCSSRulesetRegex.Replace(css, string.Empty);
}

The string I pass to the method looks something like this:

.selector-with-statements{border:none;}.selector-without-statements{}.etc{}

Currently the replace process takes up 1500ms in C#, but when I do exactly the same in Javascript it only takes 100ms.

The Javascript code I used for timing:

console.time('reg replace');
myLargeString.replace(/[^\};\{]+\{\s*\}/g,'');
console.timeEnd('reg replace');

I also tried to do the replacing by looping over the matches in reverse order and replace the string in a StringBuilder. That was not helping.

I'm surprised by the performance difference between C# and Javascript in this case, and I think there I'm doing something wrong but I cannot think of anything.

11
  • 1
    I think the regex engines used for JS and C# are different. You should not compare features across languages. Am not sure about C# but in Java, creation of regex (when you do matches() )itself is a time consuming process. It involves synchronization and other time consuming stuff. Commented Feb 20, 2015 at 11:15
  • Try changing the \s to [\f\n\r\t\v\u00A0\u2028\u2029] or using RegexOptions.ECMAScript. There are differences on how character classes are handled in JS and in C# Commented Feb 20, 2015 at 11:18
  • 1
    @TheLostMind I agree, but the difference is huge ... So it might be a signal that i'm doing something wrong in C#. I'm trying to find out what exactly. Commented Feb 20, 2015 at 11:19
  • Have a look at stackoverflow.com/questions/1252194/… Commented Feb 20, 2015 at 11:20
  • @stribizhev Do you think the regex is the problem? That does not correspond with the javascript regex execution time. But I made a test for only matching without replacing, that was fast in C#. Commented Feb 20, 2015 at 11:32

1 Answer 1

3

I can't really explain the difference of time between Javascript and C#(*). But you can try to improve the performance of your pattern (that produces a lot of backtracking):

private static Regex emptyCSSRulesetRegex = new Regex(@"(?<keep>[^};{]+)(?:{\s*}(?<keep>))?", RegexOptions.Compiled);

public string ReplaceEmptyCSSRulesets(string css) {
    return emptyCSSRulesetRegex.Replace(css, @"${keep}");
}

One of the problems of your original pattern is that when curly brackets are not empty (or not filled with whitespaces), the regex engine will continue to test each positions before the opening curly bracket (with always the same result). Example: with the string abcd{1234} your pattern will be tested starting on a, then b ...

The pattern I suggests will consume abcd even if it is not followed by empty curly brackets, so the positions of bcd are not tested.

abcd is captured in the group named keep but when empty curly brackets are found, the capture group is overwritten by an empty capture group.

You can have an idea of the number of steps needed for the two patterns (check the debugger):

original pattern

new pattern

Note: your original pattern can be improved if you enclose [^}{;]+ in an atomic group. This change will divide the number of steps needed by 2 (compared to the original), but even with that, the number of steps stays high for the previously explained reason.

(*) it's possible that the javascript regex engine is smart enough to not retry all these positions, but it's only an assumption.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you! Your regex seems to work. I will try that in the Javascript regex also
@WillemdeWit: You can't use the same pattern with javascript since it doesn't support named captures, but you can write something similar using a callback function as replacement: str.replace(/[^};{]+({\s*})?/g, function (m, g1) { return (g1) ? '' : m; });
Turns out that's 10x faster also.
10x faster than before? or 10x faster than the previous regex in JS?
@CasimiretHippolyte - Can you please bring the "original pattern" and "new pattern" regex in to the answer - external links are frowned upon as the external site may change and thus break the valuable answer for the SO community. You can certainly have the link in addition to showing the regex in the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.