is an EU member and Greek doesn't fit into those character sets. Add Turkey and Central and Eastern European countries and interoperating among the hundreds of possible 8-bit encodings becomes a nightmare.
Then there are the CJK (Chinese, Japanese, Korean) languages, which require character sets with many thousands of characters. Japan is now the world's second largest user of software, and countries that require Chinese and Korean character sets have double-digit growth rates.
For years, Unicode, the standard for uniform 16-bit encoding of information for the world's commonly used scripts, has offered a way around these linguistic problems. Today, the growing internationalization of business is giving Unicode a higher profile than ever among software developers. Savvy developers also h
ave learned ways to make working with Unicode as smooth as possible.
Unicode exists primarily because alternatives for internationalizing applications aren't foolproof. For example, various multibyte (double-byte character set, or DBCS) encodings exist and use a mix of 8- and 16-bit characters. But a variable-width encoding requires special leadbyte characters or escape sequences to identify the width of each character. Multibyte encodings lack the fixed-width simplicity of ASCII, as characters cannot be manipulated until they are individually inspected to determine their size. Moving a pointer backward through a multibyte string requires specific APIs or lengthy algorithms. If an application is targeting only one Asian country, for instance, then it's not really an international product, so multibyte encoding is good enough. But multibyte's separate requirements make it unattractive for truly international products.
Unicode to the Rescue
The Unicode/ISO10646 standard provides one
uniform 16-bit encoding that can store information from all the world's commonly used scripts. The key word here is "standard." Unicode itself is a standard, not a technology. Where technology gets involved is how the software makes use of the standard.
The Unicode concept of parking characters into a 64-KB space sounds simple enough -- until you realize there are three or four times that many characters in the world's written languages. So a key part of Unicode's design is to handle that 64-KB space as valuable real estate since it has to support a large number of scripts in one consistent encoding.
Several parts of Unicode's design help it maximize this use of what's called a
codepoint
, the permanent Unicode address of each character. For example, diacritic marks in most other character sets are not stored as unique characters, but in Unicode
each diacritic
can be separately tracked and shared among several characters. Codepoints are conserved through Han Unification,
sort of like a highway carpool lane where two or three characters with similar appearance share the same space. To Unicode, small differences in appearance should be handled as a font issue, not by inventing another character encoding. Also, Unicode does not guarantee a particular sort order, since software should handle that separately.
Unicode Assumptions
Certain specific assumptions, to maximize its utility, underlie the Unicode standard:
Permanent assignment.
The number of assigned characters has grown to a total of 38,885 in Unicode 2.0, but by design no Unicode character has become obsolete. The Unicode Consortium is allocating additional space carefully, so existing assignments can be permanently relied on.
Fixed-width, 16-bit encoding.
Much like ASCII, Unicode characters are always the same size. Nulls are 16-bit.
No escape sequences.
Since Unicode is fixed-width, there is no need for leadbyte or other noncharacte
r ranges.
Diacritics and base characters.
Any diacritic or accent mark can combine with any base character at run time, which saves encoding space. For compatibility, there are also some equivalent assignments of precomposed character combinations.
Plain text.
Unicode codepoints have no inherent meaning; they represent plain text independent of language.
Logical order.
Unicode is stored and retrieved in logical order, which is not necessarily the same as visual order.
Private use area.
Instead of cloning new character sets for custom requirements, Unicode has a preassigned area where you can add special end-user characters.
On the flip side, there are certain aspects that are specifically
not
a part of the Unicode standard, including:
Rendering and display.
The Unicode range includes complex languages (Bengali, Devanagari, etc.); however, none of today's OSes can automatically render the entir
e range of Unicode characters. Remember, the Unicode standard is a means of character encoding, not a development library or technology.
Typographical issues.
The specific appearance of a font is an artistic issue, whereas Unicode itself provides only plain text. A glyph that "looks wrong" for a particular language can be remedied by changing the typeface instead of requiring a new character set. There are fonts that map a wide range of Unicode characters, but there's no single consistent "Unicode font" that looks perfect in all the world's languages because visual adjustments must be made for some languages. Unicode is a single encoding that may require multiple typefaces.
Sort order.
Most modern OSes and database platforms can sort or compare characters and strings. Unicode does not guarantee a particular sort order.
Character input.
Keyboard layouts and input methods are dependent on software, not on character encoding.
Locale-speci
fic data.
Currency symbols and punctuation marks are not assigned to any particular locale in Unicode. The Unicode specification does not contain locale formatting information such as date and time conventions.
Unicode Tools for Developers
Once you understand what Unicode does and does not provide, the next step is to find reasonable development shortcuts offered by OSes and tools. It is entirely possible to write a fully Unicode-enabled application without any system-level support or specific tools, but there are obvious advantages to using built-in support.
If the thought of using a fixed-width 16-bit encoding seems like a waste of file space or download time, remember that the alternatives can be much worse. A bit map of text takes far more space than any character encoding, and a bit map loses all meaning -- you can't search or sort on it. If you're using bit maps to store international text, you have effectively created a very expensive fax machine.
Although Uni
code has rapidly gained popularity, several generations of OSes and tools were around long before Unicode existed (work on Unicode began in 1988, and the Unicode Consortium formed in 1991). The easiest way to support Unicode is when the operating system provides full support in every text I/O function. It is much more difficult to retrofit an operating system for Unicode than an application, and the most popular development tools don't mask the system's shortcomings in this area.
Use Where Needed
Fortunately, Unicode is not a monolithic yes/no issue for an entire product. So you might use Unicode in some areas where it's to your advantage and rely on individual character sets in other areas. Consider an incremental approach.
Here are some implementation priorities to think about when contemplating tools or operating systems.
First is the ability to convert existing data to and from Unicode. This is Unicode at its most basic level, and it's very easy to implement. Many of today
's OSes provide APIs, utilities, or sample code for this. The advantage is that Unicode makes a great central conversion point since it's a superset of many common character sets.
An example is a client/server database, where the server stores data in Unicode and each client assumes a single character set. There are some cases where a 100 percent round-trip mapping to/from Unicode is not possible, mainly with Asian multibyte encodings. But the lost characters are generally those that are unique to a proprietary character set and cannot map to any other character set. If your target OS or tools fail to provide character-set conversion, you can build your own conversion using mapping tables available from the Unicode Consortium.
Also consider character display from within a document. "Document" is not just for word processing; it can refer to any application that handles data. Although Unicode does not provide a display rendering engine, using a consistent encoding helps in the development of products t
hat can display documents in a large number of languages from the same binary. An OS that is not fully Unicode-enabled may support Unicode content display -- an important shortcut if you want to consolidate your international binaries. Text display is easier to implement than text input, and it is often a higher priority. (If you have any doubts about that, think of all those Web browsers out there.)
Next, remember that you can categorize character display into "simple" and "complex" languages. Although it contains many characters, Japanese is a "simple" language to display because its characters are static and do not change shape. Arabic, Hindi, and Tibetan are examples of complex languages because a character can dynamically change shape as you type other adjacent characters or diacritics. Today's OSes can display most simple languages but very few or none of the complex languages. If you require complex languages, you may need to use an OS add-in unless you want to write a huge amount of code from scra
tch.
Another consideration is character input into a document. If the target system accepts Unicode at the input side, you can avoid having to #ifdef for DBCS at every edit field. Cursor movement, text selection, copying, insertion, and pointer math are directly dependent upon how consistent or inconsistent your encoding is. As of today, few OSes automatically handle Unicode at the input side, so you may want to use third-party libraries or tools for Unicode-enabling from companies like
Gamma Productions
(for whom this author works), Star+Globe, and Zinc. Keep in mind that character encoding cannot solve certain internationalization requirements such as Asian input methods or switchable keyboard layouts.
User interface issues include menus, list boxes, and dialog boxes. Being able to directly send Unicode to a system UI is convenient, but it's not offered in most systems. Unless you are building a multilingual UI that can be switched at run time, you will be shipping separate bin
aries for the UI portion of your application, even if you have full Unicode document content I/O. UI localization methods are a separate topic from international content enabling, although they will feel the Unicode impact more in the future as tools continue to improve.
Developments such as Java indicate that we can someday expect dynamic support for the UI in any language, just as today it has become easier to dynamically support document content in a wide number of languages from within a single application.
Unicode Tips
Although every software development project for international use differs, here are some general suggestions you should follow as you tread the Unicode path:
First, read the book. The Unicode standard was recently updated. Not just for inducing sleep anymore, the v2.0 book is much larger and is more than just a bunch of tables. It includes technical descriptions of the world's major language scripts, material on composition of complex writing systems, sampl
e code for UTF-7 and UTF-8, and other goodies that international software designers will find useful.
Always manipulate characters, not individual bytes. You do not, for example, want to accidentally grab half a Unicode character during pointer movement.
Change any code that assumes that characters are 8 bits long. Also, check for references to any index of size 256, a potentially incorrect assumption about characters being only 1-byte long.
Remember: Even though it's a waste, a Unicode null = 16 bits.
Compilers are not always closely connected to the target operating system. Parameters for wide-character support may step into thin air, even if your compiler comes from the same company that built the OS -- best to check with the system's API reference before making assumptions.
The ANSI/ISO C standard includes the Wide Char data type, wchar_t*. In some cases you can use wchar_t* instead of char* to return 16-bit characters instead of single bytes. But you have to be careful: On many Unix i
mplementations, wchar_t returns an 8-bit value.
Lastly, use an incremental approach. Don't become discouraged by the lack of Unicode support in some operating systems; they often support at least some Unicode shortcuts, which you can use now and expand upon later. Or get some libraries and tools that help you work around these issues independently of the OS for now. They'll catch up to your farsightedness eventually.
Where to Find
Accent Software International
Jerusalem, Israel
Phone: +972 2-6793-723
Internet:
http://www.accentsoft.com