BYTE.com
RSS feed

Newsletter
Free E-mail Newsletter from BYTE.com
Email Address
First Name
Last Name




 
    
             
BYTE.com > Tangled in the Threads > 2001 > May

Character Encoding Issues

By Jon Udell

May 10, 2001

(Document Engineering :  Page 2 of 3 )



In this Article
Document Engineering
Character Encoding Issues
Document Namespace Issues
Next, I ran into an encoding problem when a subscription website I work on refused to admit a new subscriber.

It turned out that this was the first subscriber whose name contained a character not representable in 7-bit ASCII. The character is one that I can type in my emacs text editor (using its insert-ascii function) as the integer 232 (hex E8), thusly: ý. What you will see, in your browser, depends on the encoding that it's using. For many of us, that encoding will be ISO-8859-1, and you will see the character whose Unicode number is 00E8, and whose name is LATIN SMALL LETTER E WITH GRAVE:

LATIN SMALL LETTER E WITH GRAVE

But if your encoding is set to ISO-8859-2, you will instead see the character whose Unicode number is 010D, and whose name is LATIN SMALL LETTER C WITH CARON:

LATIN SMALL LETTER C WITH CARON

The Web form that accepted this ASCII 0xE8 character relayed it to a backend business system that happily stored it. But that backend system also communicated the character, by way of XML-RPC, to another system. And that system -- specifically, its XML parser -- choked on the character. It did so because the parser, MSXML, defaults to UTF-8. This, by the way, is one of those infuriating industry acronyms that is often used but rarely spelled out, and that must also be recursively expanded. Thus, UTF-8 stands for UCS Transformation Format 8, and UCS in turn stands for Universal Multiple-Octet Coded Character Set (UCS).

I found an excellent description of the properties of UTF-8 in the UTF-8 and Unicode FAQ for Unix/Linux:

Dr. Dobb's Media Center
BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE: Volume 2 - Heuristic Algorithms
The Best of BYTE: Volume 2 - Heuristic Algorithms
In this volume of Best of BYTE, we explore the emergence of some heuristic algorithms. Although we have only scratched the surface of this intriguing subject, we hope we've suggested the potential of the synthesis of heuristics and algorithms.

© 2008 Think Services, Privacy Policy, Terms of Service, United Business Media Limited
Site comments: webmaster@byte.com
Web Sites: BYTE.com, dotnetjunkies.com, Dr. Dobb's Journal, SD Expo, Sys Admin, sqljunkies.com, Unixreview



MarketPlace
Try Numara FootPrints 9, The ITSM software that Delivers Real Value, Flexibility and Results.
Automatically capture customer crash data, no debugger required. Support for .NET, C++, OS X, Java.
Understand C/C++ code in less time. Get up to speed faster with Crystal Flow for C/C++.
Develop 10 times faster ! ALM, IDE, .Net, RAD, 5GL, Database, 5GL, 64-bit, etc. Free Express version
Easily create an automated, repeatable process for building and deploying software.
Wanna see your ad here?
 

web2