Tuesday, August 26, 2014

Mongolian Vowel Separator

In ECMA-262 5.1, white space characters are defined in chapter 7.2 as following characters
\u0009 \u000B \u000C \u0020 \u00A0 \uFEFF
... and Other category "Zs", as defined for Unicode.
Well, that's simple, right? The Unicode database is available for download and easy parsing. Let's do this

$ wget http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt \
  -O unicode-6.2.0.txt
$ grep Zs unicode-6.2.0.txt | wc -l
18
Clearly, Unicode 6.2.0 itself specifies 18 characters to be category "Zs". But wait, that's Unicode 6.2.0. What about the newer version 6.3.0?
$ wget http://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt \
  -O unicode-6.3.0.txt
$ grep Zs unicode-6.3.0.txt | wc -l
17
We ended up with one character less! What's the difference?
$ diff -y -W80 <(grep Zs unicode-6.3.0.txt) \
  <(grep Zs unicode-6.2.0.txt)
0020;SPACE;Zs;0;WS;;;;;N;;;;;           0020;SPACE;Zs;0;WS;;;;;N;;;;;
00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak>   00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak>
1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;   1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;
                                      > 180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;W
2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;     2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;     2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
2002;EN SPACE;Zs;0;WS;<compat> 0020;;   2002;EN SPACE;Zs;0;WS;<compat> 0020;;
2003;EM SPACE;Zs;0;WS;<compat> 0020;;   2003;EM SPACE;Zs;0;WS;<compat> 0020;;
2004;THREE-PER-EM SPACE;Zs;0;WS;<comp   2004;THREE-PER-EM SPACE;Zs;0;WS;<comp
2005;FOUR-PER-EM SPACE;Zs;0;WS;<compa   2005;FOUR-PER-EM SPACE;Zs;0;WS;<compa
2006;SIX-PER-EM SPACE;Zs;0;WS;<compat   2006;SIX-PER-EM SPACE;Zs;0;WS;<compat
2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0   2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0
2008;PUNCTUATION SPACE;Zs;0;WS;<compa   2008;PUNCTUATION SPACE;Zs;0;WS;<compa
2009;THIN SPACE;Zs;0;WS;<compat> 0020   2009;THIN SPACE;Zs;0;WS;<compat> 0020
200A;HAIR SPACE;Zs;0;WS;<compat> 0020   200A;HAIR SPACE;Zs;0;WS;<compat> 0020
202F;NARROW NO-BREAK SPACE;Zs;0;CS;<n   202F;NARROW NO-BREAK SPACE;Zs;0;CS;<n
205F;MEDIUM MATHEMATICAL SPACE;Zs;0;W   205F;MEDIUM MATHEMATICAL SPACE;Zs;0;W
3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide>   3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide>
Apparently \u180E (Mongolian Vowel Separator) is to category "Zs" as Pluto is to the sun's planets.
It just seems that Test262 does not reflect this change yet. That's also probably why browsers still regard it as a white space, in order to not unnecessarily lower their Test262 scores.

No comments:

Post a Comment