Developing Windows software for the global market could be your greatest challenge. Here's how to start.
Dean Abramson
Take this pop quiz: Jane is a project manager at Really Big Airline Corp. Her group is developing the next version of its leading flight reservation software. There are x2 localized versions of Windows 3.1, y-1 localized versions of Windows NT, and a single, localized beta version of Chicago. How many executables should Jane's group develop if Really Big Airline Corp. plans to support facilities in 62 countries for customers and employees who work in English, German, Japanese, Russian, Greek, Hebrew, Arabic, Hindi, and Inuktitut? Who in the world speaks Inuktitut?
Your time's up. The correct answers are 1 and the Inuits of Alaska. If you received a perfect score, yo
u may skip the remainder of this article and submit your resume for Jane's job. For the rest of you, get out your highlighters and let's get started.
Globalization 101
If you are a software developer, you are likely to be involved in some facet of software localization. If you're lucky, this involves only some quick translations in your resource module, replacing the menus, dialog boxes, and stringtable. If you didn't plan ahead, you might have to build your resource module. In either case, your software is soon ready for release in 28 Latin-based languages. You tweak the code a bit, add some fonts, and presto, your software supports Russian and Greek--mostly. If you find you need a Japanese user interface, simply install the Japanese Windows SDK (Software Development Kit) from your Microsoft Developer Network CD-ROM. Then pray no one asks for Arabic.
Fortunately, there is a better way to design software for the global market. Globalization is an approach to designing software that can sup
port the processing of data in all languages simultaneously. Each of the various Windows platforms supports software globalization to some extent; however, many of these features are restricted to specific local versions.
One for All, or All for One
Language properties get extraordinarily complex once you extend beyond the Latin script. To sort them all out requires a team of experts. The best design approach to globalizing software is to maintain strict language independence throughout your code. No portion of your software should contain code that is specific to any one language. Also, your code should rely entirely on the language services offered by the operating system or another qualified language API.
If you deviate from the language-independent model, things can quickly get out of control. The task of managing the multitude of language properties on your own can be overwhelming. Think back to your days of applications development for DOS. How many printers did you have to borrow to
write all those printer drivers? Since the advent of the Windows GDI (Graphical Device Interface), you'd be crazy to use anything but the available GDI functions if you expect your application to work with any printer.
Each Windows platform offers a variety of language support functions. Unfortunately, none of the platforms are adequate for supporting true software globalization. The services they provide are generally limited to those needed for processing only in the specific local language. This means that Arabic Windows provides you with the means for processing Arabic but not Hebrew, even though the concepts and algorithms for Arabic and Hebrew are nearly identical. Far East versions of Windows provide yet another paradigm for text processing in Chinese, Japanese, and Korean, each with idiosyncrasies of its own. Consequently, localizing your product to each Windows platform can become time-consuming and expensive yet yield only a minimal solution.
Microsoft has recognized these limitations
and plans to address these issues in upcoming releases of Chicago and Cairo. This shift toward language independence is already evidenced in Windows NT.
Unicode to the Rescue
When it comes to how you store and process the characters you must use for a given language, there's little to debate. Unicode is the worldwide character-encoding standard destined to replace ASCII and the multitude of other single- and multibyte character sets currently in existence. It was constructed by a consortium whose membership list reads like a who's who of computer companies around the world. Unicode is a fixed-width, 16-bit character set, which means it can represent more than 65,000 characters. The standard encompasses scripts and general-purpose symbols for writing text in nearly every language in modern use, as well as many ancient languages.
Characters are organized by scripts so that every character with a unique semantic is assigned a unique character value. This means that a B in English is the same
as a B in French, because both English and French are written with the same script: Latin. However, they are not the same as a B in Russian, which shares the same glyph (i.e., character shape) as the Latin B but belongs to the Cyrillic script. Since you must be able to keep track of such differences in characters from dozens, if not hundreds, of languages at once, Unicode will serve as a solid foundation for software globalization.
Although there is a minor penalty for storing uniform 2-byte (16-bit) values for each character, Unicode has significant advantages. It eliminates the confusion of overlapping, single-byte (8-bit) code pages in which a character's identity is dependent on the active code page. Also, the uniform 16-bit width of each character makes it easy to determine character boundaries in contrast to multibyte character encodings that, by definition, contain characters of either 1 or 2 bytes in size.
More important, you can maintain the identity of more than 256 characters at once
and process characters by their intrinsic meaning, independent of the code points that represent the character in a font. Last, but not necessarily least, Microsoft has recognized Unicode's role as the new character-encoding standard by fusing it into the fundamental architecture of Windows NT. Windows 3.1, Win32s, and (regrettably) Chicago are all sans the Unicode support provided in NT, though this is not entirely a problem, as discussed later. Note that Chicago has some sprouts of Unicode growth as evidenced by the fact that the new, long filenames are actually stored in Unicode on disk.
What Goes Out Must First Get In
How often have you paused to consider the connection between pressing a letter key on your keyboard and seeing that letter appear in your document? A typical user will modify the keyboard driver from the Control Panel when Windows is installed, if it is not already set by default, and never give it a second thought.
Assuming you have a Latin-based keyboard, how can you t
ype characters for Punjabi or Khmer? You need a mechanism for switching the keyboard on the fly as you switch languages. This type of mechanism exists in Windows in three varieties.
The SDK for Windows 3.1 (and Win32s, which relies on the services of Windows 3.1 in this case) does not offer a function call to switch between keyboards. You need to make changes manually, independent of the application, through the Internationalization section in the Control Panel.
Although making manual changes isn't necessary for switching among languages such as English, German, French, and Spanish, it is a must for typing Greek or Rus-sian. This procedure is cumbersome when you constantly need to switch between layouts in an application or when you use only a single layout in one application and need to return to the system keyboard layout for your other applications.
Windows NT and Chicago offer the flexibility of changing the active keyboard layout through the API function calls LoadKeyboardLayout() an
d ActivateKeyboardLayout(). Any changes made by an application, however, will be made to the entire system, thereby affecting your other applications.
Additionally, you may have difficulty remembering that the Dvorak keyboard is loaded under the alias ``00010409'' (in string, not integer, format). This is because keyboards are indexed by their language identifier, discussed later. Also, the gamut of keyboards currently supported by any Windows platform is quite limited for true multilingual computing.
The third type of keyboard switching is available only in Arabic and Hebrew versions of Windows. Aware applications can switch between one local keyboard and one Latin keyboard by choosing a button on a window title bar. This exemplifies how keyboard switching becomes a priority once you begin to type text in differing scripts (such as Hebrew and English). The implementation must be extended in globalized software, however, to allow for a wider variety of keyboards that can support any script.
The standard method for entering text in Far East versions of Windows is through the IME (Input Method Editor) standard, which allows you to type the thousands of CJK (Chinese, Japanese, and Korean) characters from Latin character combinations. This, of course, saves you the hassle of learning to touch-type on a 10,000-key keyboard. The principles involved in using an IME to type any of the CJK languages can also be useful for typing in any other language. In essence, text of any type can be piped through an input method to produce text of any other type.
Detour Ahead
Once characters exist in your document, you need to learn how to process them in a language-independent fashion. If you can envision the crux of multilingual text processing as a journey that begins in Los Angeles and ends in New York City, be prepared to make a pit stop somewhere near Budapest. Everything you already know about text processing is old baggage and will get you lost in a hurry. Throw it all away. You'll have to pac
k some clean underwear for this trip.
Examine this segment of code, which should look familiar:
char szBuffer[] = ``Hello World!'';
ExtTextOut (hDC,x,y,0,NULL,szBuffer,12,NULL);
There are a thousand and one reasons why this code will now pose a problem in the multilingual world. For starters, because you are using Unicode, you must allocate 2 bytes per character instead of 1 byte. This can be rectified as follows:
wchar_t wzBuffer[] = L``Hello World!'';
where wchar_t is the data type for a wide character, and the operator preceding the ``Hello World!'' string specifies a string of typewide characters. Since the size of wchar_t is actually compiler-dependent (in Win32, it is defined to be 16 bits) and refers to any generic 16-bit character, try the additional modification:
typedef wchar_t WCHAR;
WCHAR wzBuffer[] = L``Hello World!'';
This emphasizes the fact that the data type you're storing is Unicode because WCHAR is the preferred Unicode character type in
Win32, and it provides some protection against compiler variations should you encounter problems porting your code to another platform. Also, since Windows 3.1 does not support the wchar_t data type, you should use:
typedef unsigned short WCHAR;
This happens to be the same way wchar_t is defined in the Win32 SDK.
That leaves about 999 problems to deal with, most of which involve assumptions about the content of the text that do not account for variances in processing and rendering that don't exist in the world of Latin text. For example, it would be incorrect to assume that the Unicode values you store are the same as the values you use for rendering. One process that affects this is contextual analysis, which is the act of assigning the cursive shapes of characters depending on their relative position to surrounding characters. This is most evident in a language such as Arabic where it is a requirement that each character connect appropriately to its adjoining characters, similar to cursi
ve handwriting.
As a result, each character is capable of rendering in one of four forms. There is a side benefit from all this extra work, though. By accounting for this type of behavior for Arabic, you have a nifty way of implementing beautiful calligraphic handwriting in English, provided you follow a language-independent model.
Furthermore, combinational analysis is used to combine clusters of characters that need to be rendered as a single unit, or to generate separate characters for those that may be rendered only by their components. For example, Arabic requires that the character sequence Lam followed by Alef combine to form the Lam-Alef ligature. This single glyph is now used to represent both characters, which means that your original string has one less character to render.
The biggest obstacle is the assumption that all text is displayed from left to right. Multilingual text is frequently bidirectional, which means it may be displayed as a mix of left-to-right and right-to-lef
t text. You'll have to decide from the start of your development whether or not bidirectional text support is a priority. Because it is probably the single most complex issue to implement, it may not be worth the potentially overwhelming effort for some developers. Consider carefully, because not supporting bidirectional text eliminates potential customers who use languages of the Arabic and Hebrew scripts as well as those who prefer to process Chinese in the customary horizontal right-to-left direction. Microsoft provides a complement of bidirectional functions to help arrange text of mixed directions in the appropriate visual order but only in Arabic and Hebrew Windows SDKs.
Windows NT is the only platform that provides some built-in support for transforming text through its Unicode and language APIs. As mentioned earlier, you can access NT language-support API functions through a language identifier. This provides a mechanism, called NLS (National Language Support), for specifying the appropriate ru
les that should be applied to text of a given language or locale. For example, the function CompareStringW() can be used to compare one string to another by specifying the language of the strings. This is necessary because the expected sorting order of identical pairs of strings varies from one language to another. Better yet, Microsoft offers a complete set of NLS resources across local versions of Windows NT.
Unfortunately, Win32s, which is used to run 32-bit NT applications on Windows 3.1, leaves out these Unicode and language APIs. But before you convince yourself to make the jump to NT in hopes that all will be solved, be aware that the technology for implementing the variety of character transformations for a multitude of complex languages does not yet exist and will not likely be included before the turn of the century. Until then, applications will have to continue to rely on the specific language services provided by each local version of Windows for enhanced local-language support.
Seei
ng Is Believing
All this is meaningless if you can't actually see the text on the screen and the printed page. Once the process of transforming Unicode character values into renderable characters is complete, you need to locate the appropriate glyphs in the appropriate fonts and find a way to place those characters at the appropriate location on the device. This is not as easy as it used to be. Tens of thousands of characters are needed to support all the languages of the world, but only 224 characters are available in a standard single-byte font.
One apparent solution is to simply create a Unicode font: a double-byte font that contains a glyph for each Unicode character. While this solution may seem ideal at first, it is probably the most unlikely prospect to provide a good working solution. The sheer quantity of characters makes these fonts incredibly large (on the order of 4 MB or more) and expensive to develop.
The size is compounded by the fact that many characters are rendered by more
than a single glyph, depending on numerous factors. As discussed earlier, four shapes for each character must be present to render Arabic text. Also, in the absence of some extraordinarily complex rendering algorithms, it is necessary to maintain dual or even multiple representations of nonspacing diacritics at various locations in the character frame to correctly position the diacritic on base characters. In some instances, this can be accomplished only by rendering precomposed glyphs not defined by Unicode. The most compelling obstacle to Unicode fonts is that Windows 3.1 and Win32s do not provide any means of processing or rendering double-byte fonts.
The most promising solution for rendering fonts lies in font mapping, which provides a mechanism for converting Unicode characters into independent glyph codes in a font. Here, you can use all the existing fonts and font technology built into the operating system by grouping subsets of characters within standard single-byte fonts. This is effective bec
ause text from any language (with the exception of the CJK languages) can almost always be represented effectively by 224 characters, including numbers and punctuation. Therefore, the font-mapping layer must understand the relationship between the Unicode and the available glyphs within a given font.
While this introduces an extra layer of complexity, it also provides extra flexibility that could allow you to take advantage of a more complex font, such as an English handwriting font capable of providing contextual handwriting, as in Arabic. Besides, a Unicode font would surely require a similar mapping process to choose the appropriate glyphs from the font.
In the cases of languages with very large character sets, you can use either multiple single-byte fonts (adding somewhat to the complexity of the font-mapping layer) or a double-byte font. Such a double-byte font does not have to contain all Unicode characters, just the subset of characters you want to use with that font. Although Windows NT
and Chicago are the only Windows platforms that provide the built-in functionality for handling double-byte fonts, it is possible for software developers to add this functionality to the other Windows systems, though the technology is not inexpensive to develop or license.
To #define or Not to #define
A final note about compiling your code is that the Win32 SDK for Windows NT offers the #define UNICODE function. But don't expect to include this definition at the top of your source module and automatically enjoy mixed Sinhalese and Klingon editing. You'll probably have more luck using #define NOBUGS to eliminate the possibility of logic errors creeping into your source. For most purposes, this flag merely decides the manner in which your application exchanges information with the system and does not effectively alter the range of services provided.
Windows NT is, at heart, a Unicode-based operating system. That means it keeps track of character information in Unicode rather than in the syst
em ANSI code page as in Windows 3.1 and Chicago. This character information must be converted to the single-byte ANSI code page for applications that expect characters in this form. Additionally, in 32-bit executables, resource data is inherently stored as strings of Unicode characters; the ASCII text you used to write the source module is converted at compile time. You can use #define UNICODE to help retrieve resource information in its native Unicode format, which is useful if you are actually storing multilingual strings in your resource file. It can also reduce the overhead involved in translating strings between the application and the operating system.
But beware. One of the most appealing aspects of developing applications on Windows NT is that you can enjoy the benefits of a 32-bit application running on Windows 3.1 with Win32s. But the #define UNICODE option is incompatible with Win32s, because Windows 3.1 does not contain the same underlying Unicode support built into Windows NT. So if your t
arget platform is Windows 3.1, your best bet is still to develop on NT, leaving out #define UNICODE. Even if your target environment is Windows NT, the #define UNICODE option is not necessary for multilingual computing.
Class Dismissed
What good is all this going to do if Windows does not yet provide the necessary services for software globalization? The answer is that you must first understand the principles of globalization before you can expect to support a model of language independence.
Should you feel compelled to begin implementing some of these concepts right now, you'll want to obtain the entire Unicode 1.1 standard. It consists of The Unicode Standard Worldwide Character Encoding Version 1.0, Volumes 1 and 2, and Unicode Technical Report #4. The books are published by Addison-Wesley, and you can purchase them at your local bookstore or order them from the Unicode Consortium by calling (408) 777-5870 or sending E-mail to unicode-inc@unicode.org via the Internet. The books describe
a wide variety of languages and provide algorithms for many of the issues that have been discussed here. A consolidated single-volume edition for Unicode 1.1 should be available in early 1995.
Once you become familiar with the underlying concepts, you can begin to incorporate them into your software immediately by organizing your software so that all the processes related to languages are completely encapsulated in DLLs that supply the necessary language-independent functions. By doing so, you can eventually replace your own functions with services from the system, once they satisfy all your needs.This should give you a good head start by the time the functionality becomes available.
Table: WINDOWS LANGUAGE SUPPORT (This table is not available electronically. Please see November, 1994, issue.)
Illustration: Example of the Arabic Lam-Alef character (top). Below is the Lam-Alef used in the Arabic word for peace.
Dean Abramson is the Unicode and lang
uage-technology architect at Gamma Productions, Inc. (Los Angeles, CA). Gamma recently released the beta version of its ILI (International Language Interface), which provides complete language support for Win32s, Windows NT, and Chicago. You can reach the author via the Internet at
editors@bix.com
.