Wednesday, September 12, 2007

That's Java internationalization? Really?

In an effort to knock-off one last, major item from Qwicap's "to do" list before version 1.4 is released, I've begun trying to internationalize it. As I currently conceive the problem, this mostly involves removing from the code the error messages that Qwicap automatically adds to a web application's XHTML pages. For example, there's the message that the various numeric input retrieval methods (Qwicap.getInt, Qwicap.getDouble, etc.) add to pages when input is outside of the application-defined valid range of values: "The number must be in the range 0 to 100 (inclusive). The number '900' is not in that range." Complicating matters a bit is a standard feature of Qwicap that automatically replaces phrases like "The number" in such messages with the label of the relevant form control, as found in the web page, if there is such a label in the page.

I knew that there were internationalization features somewhere in Java (I'm using Java 1.5, by the way), but that's almost all that I knew. (I haven't made a serious effort to internationalize anything since my Macintosh programming days.) So, I worked my way through my colleagues until I found one who knew more than me on this subject, and doing some follow-on Google searches and related reading, I learned that the basic facility in Java for abstracting language-specific text from one's code is the ResourceBundle, and, typically, the PropertyResourceBundle subclass, which loads language/nation-specific text from a hierarchy of "properties" files.

So I now understood that, for a start, I needed to setup a properly named and formatted "properties" file for PropertyResourceBundle to discover and load. And the documentation for the PropertyResourceBundle class promptly referred me to the Properties class for information about character encoding issues. The main issue, it turns-out, is that the Properties class only supports one character set, ISO-8859-1. Therefore, if you need to represent any non-Latin characters, a cumbersome Unicode escape sequence must be used for each and every one of them. I find an internationalization "feature" designed without direct support for non-Latin characters tough to take seriously, and in a language like Java, which uses Unicode natively, such a design beggars belief. Of course, we all having things in our past that we wish we could go back and do differently. Maybe this is one such thing for the Java platform.

Fortunately, Java 1.5 added to the Properties class a method for loading "properties" from XML files, and XML can be represented in any character set, since the XML declaration (for example: <?xml version="1.0" encoding="UTF-8"?>) can tell an XML parser how its characters were encoded. It appeared that the problem was solved.

Of course, the problem wasn't really solved, because it turns-out that the PropertyResourceBundle class does not support XML "properties" files. The solution to that problem seemed to be creating a subclass of ResourceBundle that does support them. And creating that subclass looked straightforward at first glance - just create appropriate implementations of the abstract methods handleGetObject and getKeys, then declare victory. Unfortunately, doing so is nearly useless, because the static ResourceBundle.getBundle methods that implement the hierarchical search for language- and/or nation-specific resource bundles, and which then instantiate the list of appropriate ResourceBundle subclasses that are necessary to represent the hierarchy of potentially applicable resource bundles, have their choice of subclasses hard-coded into them. So, they can instantiate the built-in ListResourceBundle and PropertyResourceBundle classes, and nothing else.

Having come that far, I couldn't admit defeat, however, so I took the time to completely re-implement the ResourceBundle.getBundle(String, Locale, ClassLoader) method in my subclass. I thought that that would finally do the trick, but, I was wrong again, because I'd forgotten that static methods can't be overridden, they can only be hidden. Which meant that the lesser implementations of ResourceBundle.getBundle (getBundle(String) and getBundle(String, Locale)) were still invoking the original implementation of getBundle(String, Locale, ClassLoader), rather than mine. That left me feeling dumb, but creating my own implementations of those lesser getBundle methods would be a piece of a cake, and, with all of the original implementations hidden, I would finally have a subclass of ResourceBundle that looked and acted just like a normal ResourceBundle, but which supported XML "properties" files. So that's what I did (mistaking the light that was drawing ever nearer for the daylight at the end of the tunnel).

At this point, it should go without saying that that didn't work, which it didn't. My IDE almost immediately pointed-out something that I hadn't noticed in the API documenation: the two lesser getBundle methods are marked final, and therefore can't even be hidden by the methods of a subclass. For some reason the primary getBundle method isn't final, but the two little convenience methods that front-end for it, are final. Like so many other aspects of my day's dalliance with internationalization, that seems utterly pointless to me, but there it is.

The only good news at that stage was that hiding those two methods didn't really matter, since I was only going to use my subclass of ResourceBundle internally, and I know to use my implementation of the getBundle method when instantiating it. In fact, having already re-written all of the hard parts of the ResourceBundle class well enough for my purposes, I didn't even need to subclass ResourceBundle anymore; I could pretty much just remove the "extends ResourceBundle" phrase from my class' declaration and be done with it.

On the other hand, the main reasons for not implementing my own resource scheme from scratch at the beginning were (1) the belief that by using the familiar mechanism represented by the ResourceBundle class, other developers would have an easier time understanding my code, if the need ever arose, and (2) the hope that somewhere under all that rubbish lay a core of undiscovered, but wonderful, internationalization functionality, from which my code would benefit. There's some sense to the former concern, but the latter appears to be utterly groundless; if there is anything wonderful below, I haven't found it, and I don't see anywhere left for it to hide.

The niftiest feature associated with using "properties" files for internationalization that I'm aware of, is that the internationalized text can contain MessageFormat patterns. However, that feature is orthogonal to the Properties and ResourceBundle classes, which make no use of MessageFormat, and therefore leave it as an exercise to the developer to make his/her MessageFormat patterns do anything.

By the way, the documentation for the Properties class says of its loadFromXML method and the XML files that it loads: "the system URI (http://java.sun.com/dtd/properties.dtd) is not accessed when exporting or importing properties; it merely serves as a string to uniquely identify the DTD". This turns out to be somewhere between "misleading" and "wrong"; either that DTD really is accessed, or a local copy of it is used instead, because the rules in that DTD are enforced when loadFromXML reads XML. Which makes the XML support in the Properties class useless for my purpose, because the strings that Qwicap needs to make internationalizable frequently contain elements of XHTML markup. Those XHTML elements are well-formed (in the simple sense that any start tags are matched by corresponding end tags), so they would pose no problems for an XML parser that wasn't trying to enforce the rules in a DTD, but, sadly, that is not the case here.

Qwicap's XML engine made implementing my own support for XML "properties" files a trivial matter, but the end result of doing so was that, of the three Java classes that appear to be the cornerstones of Java of internationalization (ResourceBundle, Properties, and MessageFormat), I had to re-implement two of them to get an internationalization capability that was useful to my application.

Maybe my requirements are unusual, but they boil down to nothing more than supporting encodings other than the obviously limited ISO-8859-1 for my application's internationalized text, and needing to include XML (specifically XHTML) elements within some of that text. Neither requirement strikes me as remarkable individually, or in combination.

I'm new to Java internationalization, so it's easy to believe I'm missing something. Please set me straight if I am. If I'm not, put me down as lightly stunned by the substantially roll-your-own character of the Java platform's internationalization "solution".

7 comments:

  1. Anonymous5:02 AM CDT

    If it's just the encoding that bothers you, why don't you just write a tool that converts the XML property file (or your own format) to a regular property file?


    - Thorwin

    ReplyDelete
  2. All decent i18n frameworks I've seen were either made as a part of the application or as a part of the framework, such as in Seam. The standard i18n means are pretty wretched which is sad.

    ReplyDelete
  3. Hi,

    I've written and released a better i18n library for Java (than the JDK I think :-), perhaps it helps you solve some of your problems.

    http://messages.reposita.org/

    Peace
    -stephan

    --
    Stephan Schmidt :: stephan@reposita.org
    Reposita Open Source - Monitor your software development
    http://www.reposita.org
    Blog at http://stephan.reposita.org - No signal. No noise.

    ReplyDelete
  4. With regard to Thorwin's question about writing a tool to convert an XML properties file to a conventional properties file as workaround for the lack of support for XML properties files by ResourceBundle, and the lack of support for any character set other than ISO-8859-1 in conventional properties files, there are two reasons why I don't regard that as a good solution:

    (1) Having to discover and then use such a custom conversion/deconversion tool constitutes an additional complication that no prospective internationalizer is going to be expecting, so it complicates and obscures the process of internationalization, rather than aiding it.

    (2) Such a tool would reduce the internationalization files of all languages utlizing non-Latin characters to a non-human-readable gobbledygook of Unicode escape sequences which would be essentially impossible to directly read, write or modify. Since the goal of internationalization is improving communication, reducing to gibberish the files that are supposed to enable it would be, in my judgement, counterproductive, to put it mildly.

    By the way, there are links in the Javadoc for the Properties class that point to implementations of a tool that appears to do what you're suggesting. It's called "native2ascii" and appears to be a part of the normal Java distribution. (Anyway, it's present on my Mac.) This undercuts my first concern, but only partially, as neither of us seems to have noticed its existence until now, and we aren't likely to be alone in that ignorance. My second, more fundamental objection remains untouched, and, to my mind, is reason enough to reject the use of such a tool and the encoded properties files it would produce.

    (As an aside, the "man" page doesn't make it clear to me how one would use native2ascii, because its "-encoding" parameter specifies only the character set to convert to, or to convert from (it's not clear which), and it would be necessary to specify both the encoding of the input file and of the desired output file in order to perform the intended conversion correctly. If the tool actually lives up to its name and therefore only converts to/from ASCII, then it can only directly represent about half the characters in the ISO-8859-1 character set, which creates additional problems. On the other hand, if the author of the tool didn't understand the difference between ASCII and ISO-8859-1 when developing the tool, I'd be very hesitant to trust it. And the only reason we're even talking about such a tool is because whoever originally implemented the Properties class, it seems to me, failed to grasp the fact that bytes are not characters. I was confused on that point for many years, so I'm sympathetic, but basically, everything we've discussed amounts to kludging around that one mistake. Alternately, if they weren't confused, and the use of ISO-8859-1 was a considered design decision, then everything we've discussed amounts to kludging around that one decision. Either way, it's a bad situation, in my opinion.)

    ReplyDelete
  5. Not sure if you're still interested in this, but I wanted to comment briefly on the native2ascii tool.

    The way to use it is as part of the build process. You'd keep all the localized properties files in UTF-8 -which makes it easy to edit them in a Unicode-aware editor-, and then run native2ascii as part of the build (e.g., using an Ant task). That way you don't need to mess with the encoded properties files.

    While that's not as simple as if Java could handle UTF-8 properties files directly, it's not hard to get used to it once the build process takes care of it.

    The tool may not be widely talked about, (maybe because desktop Java and the L10N and I18N that go with properly doing it never took off), but if you dig into the Java I18N tutorial from Sun, it's all covered.

    You're right to assume that all the I18N stuff wasn't baked into Java from the beginning - the flawed Properties class predates that, and makes some tasks that should be easy harder than necessary.

    Just my 2 cents.

    ReplyDelete
  6. Just after i posted this message http://forum.java.sun.com/thread.jspa?threadID=5303828&tstart=0 , i found your blog entry. You summarized the problem very well and i totally agree with the points you mentioned.

    ReplyDelete
  7. Very dense and interesting article. If you need a proper tool for java localization, you should keep in mind http://poeditor.com. It organizes files and translation strings in a very useful way for translators and has many facilities.

    ReplyDelete