BYTE.com > Tangled in the Threads > 2001 > May
Character Encoding Issues
By Jon Udell
May 10, 2001
(Document Engineering
: Page 2 of 3 )
Next, I ran into an encoding problem when a subscription website I work on refused to admit a new subscriber.
It turned out that this was the first subscriber whose name contained a character not representable in 7-bit ASCII. The character is one that I can type in my emacs text editor (using its insert-ascii function) as the integer 232 (hex E8), thusly: ý. What you will see, in your browser, depends on the encoding that it's using. For many of us, that encoding will be ISO-8859-1, and you will see the character whose Unicode number is 00E8, and whose name is LATIN SMALL LETTER E WITH GRAVE:
But if your encoding is set to ISO-8859-2, you will instead see the character whose Unicode number is 010D, and whose name is LATIN SMALL LETTER C WITH CARON:
The Web form that accepted this ASCII 0xE8 character relayed it to a backend business system that happily stored it. But that backend system also communicated the character, by way of XML-RPC, to another system. And that system -- specifically, its XML parser -- choked on the character. It did so because the parser, MSXML, defaults to UTF-8. This, by the way, is one of those infuriating industry acronyms that is often used but rarely spelled out, and that must also be recursively expanded. Thus, UTF-8 stands for UCS Transformation Format 8, and UCS in turn stands for Universal Multiple-Octet Coded Character Set (UCS).
I found an excellent description of the properties of UTF-8 in the UTF-8 and Unicode FAQ for Unix/Linux:
|