Tuesday, July 3, 2007

Twenty 42s

My recent work on Qwicap has concentrated on character set issues, first concerning the loading of XML/XHTML markup, and now concerning receiving input from browsers. I've learned a lot. For instance, unlike output sent to a client, input sent from the client in HTTP includes no provision for identifying the character set. You can, of course, add an "accept-charset" attribute to the "form" elements in your web application's pages, but there's no guarantee that the client will support the character set you specify, unless you confine yourself to ISO-8859-1. You could try inferring the client's supported character sets from the "accept-charset" headers of its HTTP requests, but then you would almost always be lied to, because most browsers throw in a wildcard that means that they accept all official character sets. Since there are currently 254 of those, what do you suppose the chances are that a browser sending that wildcard really supports them all? Zilch, in my tests.

The icing on this particular cake is that a Java application server is required to provide your web application with the client's input in the form of Java String objects, via the ServletRequest class. How can the application server reliably translate the input bytes it receives into the Unicode characters that will populate those String objects? Not knowing the character set used to encode the characters represented by those bytes, it can't, but the API requires that it blunder ahead with String creation regardless.

Twenty 42s

All of these character set torments led me to take a close look at the first 65,536 characters of Unicode, you know, for fun. Did you know that among those characters there are twenty ways to represent the number, just to pick one arbitrarily, forty-two? There are. And that's assuming that you don't mix-and-match amongst languages. If you do, there are 400 representations. Specifically, in those first 65,536 Unicode characters, there are twenty discrete groups of characters for representing decimal digits. Your browser probably doesn't even have a font containing glyphs for all of them, but here are those twenty forty-twos, along with their character codes expressed in hexadecimal:

420xFF14 and 0xFF12
᠔᠒0x1814 and 0x1812
៤២0x17E4 and 0x17E2
፬፪0x136C and 0x136A
၄၂0x1044 and 0x1042
༤༢0x0F24 and 0x0F22
໔໒0x0ED4 and 0x0ED2
๔๒0x0E54 and 0x0E52
൪൨0x0D6A and 0x0D68
೪೨0x0CEA and 0x0CE8
౪౨0x0C6A and 0x0C68
௪௨0x0BEA and 0x0BE8
୪୨0x0B6A and 0x0B68
૪૨0x0AEA and 0x0AE8
੪੨0x0A6A and 0x0A68
৪২0x09EA and 0x09E8
४२0x096A and 0x0968
۴۲0x06F4 and 0x06F2
٤٢0x0664 and 0x0662
420x0034 and 0x0032

By the way, the Java methods for converting from strings to binary numeric values, like Integer.parseInt, accept all of those character sequences as valid inputs, as well as all of the cross-language mixes. Worth thinking about if you have a web application that accepts numeric inputs. And the fun doesn't stop with decimal digits; for instance, you can find a replication of most of ASCII between 0xFF00 and 0xFF60.

What's the point? Nothing in particular, except to reinforce the general warning that one is ignorant of character encoding issues at one's peril. (Before you ask, no, neither my software's handling, nor my understanding, of these issues is perfect; that's why I know a little something about the "peril" part.)