Tue, September 24, 2002

The joys of unicode, UTF-8, and form internationalization

I’ve been working on a web app that uses UTF-8 encoding and have been surprised at how little information is available about how to do internationalization that works with all browsers. (A.J. Flavell’s “FORM submission and i18n” article and related charset issues site were quite helpful.) Consider this my small contribution. Here’s the scenario for my app: users can enter a search string and it will search a database for matching entries. The search form includes a few other contols so there are a number of variations and potentially 20 named fields. I only want to show the relevant name/value pairs on the URL if possible. If I submit the form with the GET method, all fields are shown on the URL, even if their values are null, which is fairly ugly. I could use a POST, but then the URL can’t be sent to others and generate the same result. The search is also an idempotent transaction (it is just retrieving data and has no side-effects), so I’d prefer to use GET.

When I submit the form, the search field is properly encoded according to RFC 2279 (which obsoletes RFC 2044). That means that non-US-ASCII characters are converted into a %nn format, where n is a hexadecimal digit. For example, α would be converted into %CE%B1. Since I want control of the resulting URL, I thought I’d use JavaScript’s location.href to set the URL explicitly. I then ran into the problem of how to properly URL-encode the strings. I’d used the JavaScript escape() function in the past to fix up ASCII characters that are not URL safe, but escape() does not handle unicode characters well. In IE, unicode characters are suported, but the function generates a %unnnn format which is not well understood by servers. It would give %u03B1 for the previous example. What to do?

I found the encodeURI() and encodeURIComponent() functions that are new to IE5.5, Netscape 6+, and Mozilla. Thankfully, they do exactly what I want. Now I just need to figure what to do with older browsers such as IE5 and Netscape 4 (forgetting them is not yet an option). I wonder if anyone has written JavaScript code that does this encoding. I suppose I could submit the form and just live with the long URL.

I just happened to think that all my mozilla and IE5.5 bookmarklets should probably be converted to using encodeURIComponent() instead of escape(). That would allow searching for non-ASCII characters.