My recent work on Qwicap has concentrated on character set issues, first concerning the loading of XML/XHTML markup, and now concerning receiving input from browsers. I've learned a lot. For instance, unlike output sent to a client, input sent from the client in HTTP includes no provision for identifying the character set. You can, of course, add an "accept-charset" attribute to the "form" elements in your web application's pages, but there's no guarantee that the client will support the character set you specify, unless you confine yourself to ISO-8859-1. You could try inferring the client's supported character sets from the "accept-charset" headers of its HTTP requests, but then you would almost always be lied to, because most browsers throw in a wildcard that means that they accept all official character sets. Since there are currently 254 of those, what do you suppose the chances are that a browser sending that wildcard really supports them all? Zilch, in my tests.
The icing on this particular cake is that a Java application server is required to provide your web application with the client's input in the form of Java String
objects, via the ServletRequest
class. How can the application server reliably translate the input bytes it receives into the Unicode characters that will populate those String
objects? Not knowing the character set used to encode the characters represented by those bytes, it can't, but the API requires that it blunder ahead with String
creation regardless.
Twenty 42s
All of these character set torments led me to take a close look at the first 65,536 characters of Unicode, you know, for fun. Did you know that among those characters there are twenty ways to represent the number, just to pick one arbitrarily, forty-two? There are. And that's assuming that you don't mix-and-match amongst languages. If you do, there are 400 representations. Specifically, in those first 65,536 Unicode characters, there are twenty discrete groups of characters for representing decimal digits. Your browser probably doesn't even have a font containing glyphs for all of them, but here are those twenty forty-twos, along with their character codes expressed in hexadecimal:
42 | 0xFF14 and 0xFF12 |
᠔᠒ | 0x1814 and 0x1812 |
៤២ | 0x17E4 and 0x17E2 |
፬፪ | 0x136C and 0x136A |
၄၂ | 0x1044 and 0x1042 |
༤༢ | 0x0F24 and 0x0F22 |
໔໒ | 0x0ED4 and 0x0ED2 |
๔๒ | 0x0E54 and 0x0E52 |
൪൨ | 0x0D6A and 0x0D68 |
೪೨ | 0x0CEA and 0x0CE8 |
౪౨ | 0x0C6A and 0x0C68 |
௪௨ | 0x0BEA and 0x0BE8 |
୪୨ | 0x0B6A and 0x0B68 |
૪૨ | 0x0AEA and 0x0AE8 |
੪੨ | 0x0A6A and 0x0A68 |
৪২ | 0x09EA and 0x09E8 |
४२ | 0x096A and 0x0968 |
۴۲ | 0x06F4 and 0x06F2 |
٤٢ | 0x0664 and 0x0662 |
42 | 0x0034 and 0x0032 |
By the way, the Java methods for converting from strings to binary numeric values, like Integer.parseInt
, accept all of those character sequences as valid inputs, as well as all of the cross-language mixes. Worth thinking about if you have a web application that accepts numeric inputs. And the fun doesn't stop with decimal digits; for instance, you can find a replication of most of ASCII between 0xFF00 and 0xFF60.
What's the point? Nothing in particular, except to reinforce the general warning that one is ignorant of character encoding issues at one's peril. (Before you ask, no, neither my software's handling, nor my understanding, of these issues is perfect; that's why I know a little something about the "peril" part.)