Tuesday, August 26, 2014

Finding De Bruijn sequence to calculate Math.clz32.

As specified in the latest Javascript ES6 draft, Math.clz32 (formerly Number.prototype.clz) calculates the number of leading zeros to an 32-bit integer, and returns 32 for 0.
The simple implementation for this is therefore:
function clz_simple(x) {
  for (var i = 0; i < 33; i++) {
    if (x & 0x80000000) return i;
    x <<= 1;
  }
  return 32;
}
In the worst case, the loop is executed 33 times. A smarter implementation would use binary search.

function clz_binary(x) {
  if (x == 0) return 32;
  var result = 0;
  if ((x & 0xFFFF0000) === 0) { x <<= 16; result += 16; };
  if ((x & 0xFF000000) === 0) { x <<=  8; result +=  8; };
  if ((x & 0xF0000000) === 0) { x <<=  4; result +=  4; };
  if ((x & 0xC0000000) === 0) { x <<=  2; result +=  2; };
  if ((x & 0x80000000) === 0) { x <<=  1; result +=  1; };
  return result;
}
However, let's look at the following function.
function canonicalize(x) {
  x |= x >> 16;
  x |= x >> 8;
  x |= x >> 4;
  x |= x >> 2;
  x |= x >> 1;
  return x;
}
What this function does is to map all numbers with the same clz-value to the largest one. The co-domain is 2n;-1 for n in {0 ... 32}. For example, all numbers from 0x00100000 to 0x001fffff all result in 0x001fffff. At this point, we can do a simple table lookup mapping {1, 3, 7, ..., 232-1, 0} to {0, 1, ..., 32}.
var lookup_table = new Array();
for (var i = 0; i < 33; i++) {
  lookup_table[(Math.pow(2, i) - 1) | 0] = 32 - i;
}
function clz_lookup(x) {
  return lookup_table[canonicalize(x)];
}
The issue with this is that the lookup_table is a sparse array. Depending on the Javascript VM, this table lookup may be very slow, since the VM uses a hash table in order to save space. On V8, clz_lookup takes 2.5 times as long as clz_binary for me. Not ideal. Instead, we could do the hashing ourselves and strive for an ideal hash, mapping each key to exactly one slot in the hash table. Fast-forwarding the maths behind De Bruijn sequences, it essentially guarantees that there is such a hash, in the following form:
function clz_debruijn(x) {
  if (x == 0) return 32;
  return debruijn_table[((canonicalize(x) * magic) >> 27) + 16];
}
Shifting 27 bits to the right leaves 5 bits, representing 25 = 32 values, enough for {0, 1, ... 31}, so that we need special handling for x == 0. Since right shifts are sign-extending, we add 16 to move the indices from {-16, -15, ... 15} to {0, 1, ... 31}.
What's left is finding the suitable magic number. There are only so many candidates, so brute force is just fine.
var lookup_table = new Array(32);
for (var magic = 0x07c00000; magic < 0xffffffff; magic++) {
  var fail = false;
  for (var j = 0; j < 32; j++) lookup_table[j] = null;
  for (var j = 1; j < 33; j++) {
    var x = Math.pow(2, j) - 1;
    lookup_table[((x * magic) >> 27) + 16] = clz_simple(x);
  }
  for (var j = 0; j < 32; j++) fail |= (lookup_table[j] == null);
  if (!fail) print("0x" + magic.toString(16), JSON.stringify(lookup_table));
}
Among the first entries is
0x7c4acdd [23,19,11,3,16,14,7,24,12,4,8,25,5,26,27,0,31,22,30,21,18,10,29,2,20,17,15,13,9,6,28,1]
So the resulting code is:
var debruijn_table = [23,19,11,3,16,14,7,24,12,4,8,25,5,26,27,0,
                      31,22,30,21,18,10,29,2,20,17,15,13,9,6,28,1];
function clz_debruijn(x) {
  if (x == 0) return 32;
  x |= x >> 16;
  x |= x >> 8;
  x |= x >> 4;
  x |= x >> 2;
  x |= x >> 1;
  return debruijn_table[((x * 0x7c4acdd) >> 27) + 16];
}
Unfortunately, in V8 this is still slower than the binary search approach. For some reason, Firefox is a lot slower than V8 on all of the described implementations.
But all of this does not matter anyways. Almost all CPUs have native support for bit scan, and V8 will emit the suitable machine code if the caller function is optimized. And I assume that Firefox will do that as well, when they implement it.
On ARM CPUs, there is even a CLZ instruction that does exactly what the ES6 spec draft requires. On Intel CPUs, it's not that easy, as the BSR instruction does not do exactly the same thing. For once, its behavior is undefined for 0. The workaround may look like this:
  BSR result, input
  JNE not_zero
  MOV result, 63
not_zero:
  XOR result, 31

No comments:

Post a Comment