Tuesday, August 26, 2014

It's a little known fact that Firefox's Spidermonkey has switched to a new RegExp engine recently. They abandoned YARR, which they took from WebKit's JSC a few years ago, and replaced it with the high-performance Irregexp from Chrome's V8. Porting it to Spidermonkey certainly was a tremendous job. As a result, their benchmark score on Octane/regexp improved a lot. (Orange for Firefox/Spidermonkey, green for Chrome/V8)

But for some reason, it's still pretty far away from V8's score in that particular benchmark. Let's speculate on why this is the case.

Irregexp is a feature complete regexp engine with regard to the syntax defined in ECMA-262. Except for simple character matches, it generates a NFA from the regexp literal and compiles it down to machine code. It uses a bunch of nifty tricks to speed things up. For simple character matches, it uses a couple of different matching algorithms, including Boyer-Moore-Horspool.
To see where the main differences are, I wrote a few micro-benchmarks to test certain RegExp features. I take advantage of the fact that Irregexp suffers from "catastrophic backtracking" for the first two items.
function perf(f, desc) {
  var start = performance.now();
  f();
  document.write(desc + " : " + ((performance.now() - start) | 0));
}

perf(function() { /(a*)*b/.test("aaaaaaaaaaaaaaaaaaaaaaaa"); },
     "backtrack");
perf(function() { /(\d*)*a/.test("012345678901234567890123"); },
     "char class");
perf(function() {
       for (var i = 0; i < 10000000; i++) /\w/.test("a");
     },
     "simple");
perf(function() {
       for (var i = 0; i < 1000000; i++)
         "abcdefghijklmnopqrstuvwxyz".match(/(.)/g);
     }, "global match");
perf(function() {
       for (var i = 0; i < 1000000; i++)
         "abcdefghijklmnopqrstuvwxyz".replace(/(.)/g, "abc");
     }, "global replace");
perf(function() {
       for (var i = 0; i < 1000000; i++)
         "abcdefghijklmnopqrstuvwxyz".match(/X/i);
     }, "ignore case");
The result I got is this:
Chrome stable 35.0.1916.153:
backtrack :       613
char class :      614
simple :          464
global match :   1510
global replace : 1232
ignore case :     966

Firefox nightly 33.0a1:
backtrack :       640
char class :      733
simple :          873
global match :   2026
global replace : 2589
ignore case :    2359

Firefox stable 30.0:
backtrack :        69
char class :       69
simple :          410
global match :   1361
global replace : 2525
ignore case :    5653
It's pretty clear that the nightly version of Firefox is running a different regexp engine than stable. The latter does not suffer from exponential backtracking, but is also much slower at the last benchmark item.

Comparing the "backtrack" item, Firefox nightly's score seems almost on par with that of Chrome. Since the regexp is run only once, both spend almost all time in the compiled code. It seems that the difference in code quality is negligible.

The second item is almost the same, except that we are using a character class. Here the difference is a lot bigger. It seems that either the code Firefox generates to match character classes is inefficient, or a shortcut that V8 uses is not used by Spidermonkey.

The third through last item not only tests regexp code performance, but also tests entering and exiting the regexp code as well as processing the result. On those items Spidermonkey seems to really lag behind.

For global matches, V8 collects more than only one set of results at a time, therefore reducing the number of times it has to enter, exit and reenter the regexp code. This probably has not been ported yet either.

For the "ignore case" item I suspect that the performance gap is a result of both slow entry/exit and inefficient code for case insensitive matches.
But given that Spidermonkey has only just ported Irregexp, and there is still a lot of room upwards with regard to performance, I'm sure the Mozilla folks are busy improving their future benchmark scores.

No comments:

Post a Comment