guage
. Microsoft places its translation and localization costs at as much as $300,000 or more, depending on complexity.
The process is also time-consuming. Companies spend an average of six months to a year to prepare a localized-language version of software. This includes bringing in localization experts to check for technical, cultural, and lingual accuracy.
Money Makes Money
If it is such a hassle, why bother? Companies do it because selling localized-language versions is big business. The Software Publishers Association (Washington, DC) estimates that of the more than $28.7 billion in packaged business software sold in 1995, 60 percent of those sales were outside the U.S.
For many software companies, a global
product strategy
is akin to planning a building in which you will house your company five years from now. While you might not need the opportunity today, will you tomorrow? As Orlando Ayala, Microsoft's vice president in charge of int
ercontinental sales and marketing, notes: "There is a very direct relationship between growth and the ability to produce localized versions." Over 50 percent of the company's more than $8 billion (fiscal 1996) revenues came from outside the U.S.
Although tempted by the opportunity to make money offshore, many independent software developers opt to focus their sole attention on creating an English-language version of their software. It is only when that software is selling strongly in the U.S. market that they start to think about localizing their product for other markets. While such an approach may save money in the near term, it can prove to be costly over the long haul. An English-language software application
not
written with a foreign-language port in mind means that the developer will not only localize the software but also rewrite a significant portion of its code. Meanwhile, your better-prepared competition beats you.
Preparing for World Markets
Thanks to key deve
lopments in OS software, in the evolution of Unicode (see the article "Unicode Evolves"), and in other areas, independent developers now have more opportunities to prepare their products for localization long before they even think of stepping out of the country or their home state.
While support for Unicode in OSes and other developments make it easier for an independent software company to plan for future localization, they by no means ensure success. It is critical that developers lay some groundwork of their own at the
code level
to ensure that their product is fully Unicode-compliant and has enough extensibility to sell overseas. Here's what the specialists at Berlitz Translation Services say developers need to be aware of when writing English-language versions of their product:
1. Be sure your application is Unicode-enabled,
meaning that it can accommodate single-byte (e.g., Arabic, Hebrew, and Cyrillic), double-byte (e.g., Chinese and Japanese), and Roman
character sets.
2. Understand exactly what the OS you intend to write
for supports as far as Unicode features. Different versions of the OS may have done different things with Unicode.
3. Do not make assumptions that your printer and display drivers can
accommodate Unicode in final output. For instance, some character sets when displayed read from right-to-left or from the bottom of the screen to the top.
4. Avoid hard-coding certain standard items
such as input fields, fixed character width, dialog boxes, and help definitions. Your application will need flexibility. Foreign languages tend to take 20 percent more space than English.
5. Avoid string concatenation.
Separate strings of precoded phrases don't come together the same way in one language that they do in others. For example, stringing together the local string for "file," the local string for "error," and the local string for "has occurred" may not give the local stri
ng for "file error has occurred."
The key point to following these steps is to make sure that, as a developer, you are aware of what you need, API-wise, to go global long before you do. Developers who don't follow these steps will have a difficult time localizing their code for different parts of the world. There are too many idiosyncrasies from one country to the next to be ready to localize a product on an ad hoc basis.
Your globalization chores will become easier if you focus on the areas mentioned in the following sections.
--H.H.P.
Get Help from the OS
Developers should make their application as culturally and human-language independent as possible. The OS can help them. Rather than writing quick-and-dirty language-dependent code first and then rewriting it later, programmers should use the routines provided as part of the OS for tasks such as text input and output. Apple's Macintosh Toolbox was the
first to provide this type of service, but Windows and Unix environments have long since stepped up to the same standards. Just as Windows programmers don't have to write their own printer drivers anymore, there is no need to try to handle double-byte character I/O "manually."
Thanks to these developments on the OS front, independent developers now have more localization options. Among the most important developments is the agreement between leading software companies for a Unicode character-encoding specification that provides OS-level support to multiple-byte as well as single-byte character sets.
Another key development taking place is that OS giants such as Microsoft and Apple are in the process of further streamlining localization into their overall product-development efforts. For instance, Microsoft is centralizing product development so that localization experts, programmers, and product managers work in tandem rather than in phases.
Apple, currently overhauling its Mac OS strategy, is cr
eating a Software Development Kit (SDK) that will make localization support automatic as opposed to optional.
--U.F.
Stick to the Script
International development efforts usually revolve around text output in the target language's writing system. What do users want and expect? Character codes, fonts, and scripts are good places to find the answers.
A writing system, also called script, consists of rules for creating a visual representation of language and an accompanying character set. More than 30 character sets are in use today throughout the world, including Chinese, Arabic, Roman (for English), Cyrillic, Japanese, and Hebrew.
These systems differ in their approaches for creating graphic representation of words. In alphabetic scripts such as Roman, Greek, and Cyrillic, the characters typically stand for the basic sounds (i.e., phonemes) of the corresponding language. In syllabic systems (e.g., Japanes
e kana), characters represent syllables. Complex scripts, such as Chinese Hanzi and Japanese kanji, use up to 30,000 ideographic characters: They stand for sounds and incorporate meanings of words.
Sorting Scripts
From a programmer's technical perspective, you first need to categorize scripts according to their unique characteristics as follows:
- Simple scripts, such as Roman and Cyrillic, use left-to-right lines and fill pages top-to-bottom. With fewer than 256 characters, you can represent each in a single byte. They do not need context information.
- Complex scripts with large character sets (e.g., for the languages of China, Japan, and Korea) need 2 bytes for each character. There usually are no spaces between words. For direction, they use different combinations, including top-to-bottom lines, left-to-right lines, right-to-left pages, and top-to-bottom pages. (Periodicals are usually printed in vertical columns, while technical documents are often displayed in left-t
o-right lines.) These scripts do not require context information. The characters may have no sorting order that would correspond to alphabetic sorting in the less complex scripts and may also have no uppercase or lowercase.
- Context-sensitive systems will also use fewer than 256 characters, but a character may look different depending on its context (i.e., the surrounding characters). This is similar to handwritten Roman text. An Arabic letter has up to four possible shapes (i.e., glyphs). While typing Arabic text, previously entered characters change in appearance.
- Bidirectional scripts, such as Arabic and Hebrew, use right-to-left as their main direction, but numbers and interspersed words from Roman scripts are written left-to-right. They have fewer than 256 characters. Hebrew is context-independent; Arabic is both bidirectional and context-sensitive.
Cluster Control
Some scripts, such as Thai, Korean, and Hebrew, have character clusters. This phenomenon, which is s
imilar to but more complex than something such as accents in French or Spanish, requires special consideration from the programmer. Highlighting and deleting text, as well as the movement of the insertion cursor, have to treat the clusters as single characters. Clusters can have up to five components.
Ligatures are a special case of clusters. In Roman scripts, it's usually a sequence of two characters that acts as a unit. To capitalize a ligature, for example at the beginning of a sentence, both characters must change to uppercase.
Boundary Markers
Even if there are no clusters, character and word demarcation -- finding boundaries between words and characters -- can be challenging. For example, even in a seemingly simple Roman text, we do not break lines directly before most punctuation marks; but it is OK to break the line before an opening parenthesis. In bidirectional scripts, this is more complex.
In many Asian systems, there are no word delimiters, so breaking lines or co
lumns requires
special algorithms
. In Japanese (with no spaces between words), line breaks are allowed anywhere within a word, but you must not split multicharacter symbols.
Word boundaries are often difficult to define. For example, both
Feueralarmschalter
in German and fire alarm switch in English are essentially compound nouns, but in English, there are spaces between each of the words.
Special typesetting styles (e.g., boldface, italic, and underlining) may not translate well to another writing system. Different cultures also have different conventions for expressing emphasis.
Developers should prepare for all kinds of sorting preferences and strategies. Even in the relatively "safe" realm of Roman scripts, there are variations. In English, sorting is from
A
to
Z
. In German, characters
with
an umlaut sort directly after the character without an umlaut; but in Swedish,
Ö
comes last in the alphabet. Spanish has double charac
ters (e.g.,
ll
and
ch
) that sort as single characters.
As any Westerner who has tried to look up an expression in a Chinese dictionary knows, sorting complex scripts is difficult. One criterion here is the radical (i.e., root) of the character, the number of basic strokes that is needed to create the character. In these multibyte scripts, multiple character patterns may stand for the same word and should thus sort together. This could require multilevel sort algorithms.
Editing Issues
These categorizations raise a number of issues about editing functions. For example, to convert between uppercase and lowercase, programmers familiar with the standard ASCII character set often add or subtract 32. (This constant is the difference between the numerical ASCII codes for
A
and
a
.) But even for the extended character set needed to represent Western European languages, this approach may not work. For example, the difference between
Ä
and
ä
may not be 32. The accent or umlaut may also have to disappear in uppercase.
The same challenges for sorting and capitalization apply to search algorithms. String searches must be capable of accepting different character sequences as equivalent. Wild-card characters that are used in Find dialog boxes may have a meaning in the target language and might therefore be unusable.
In a speed-search situation, where you type the first letters of a word, both accented and unaccented occurrences should appear. (If I type
a
, I expect to get words starting with
A
,
Ä
,
a
,
à
, and so forth.)
Hyphenation of long words at line breaks is not as easy in other languages as it seems in English. In German, characters in a word may have to change to hyphenate it. French sometimes requires a hyphen if two otherwise separate words extend beyond line's end.
Inflection Impact
English-speaking developers should keep in mind that ot
her languages may have much more inflection (i.e., changes to words) depending on tense, case, and gender. While French and other Romance languages have two genders, German has three, and there is one language that has 17. This may, for example, influence the way you write ordinal numbers.
There is no magic bullet for dealing with
the differences
among languages. Awareness of the possibilities and flexibility in the code are the most important guidelines. You'll still get surprised occasionally, but not as often as if you just coded your program and dumped it into the Translate-O-Matic.
--U.F.
Avoid the Global Faux Pas
Icons, metaphors, and symbols are some other important issues for developers who need to create software applications that are used internationally. Among the things to watch for are symbols or even colors that may be offensive in other cultures.
In addition, some
icons are not easily recognizable across cultures. One example is a mail applicaton wth a mailbox icon that raises the red flag when mail has arrived. This type of mailbox is used only in the U.S.
While you must consider these issues carefully, the user interface should still use symbols as much as possible. Many international symbols are widely understood. In Europe, traffic signs are generally symbols wth no text, because drivers who do not speak the language will have to understand them.
A design goal of icons is that they should contain no text, because it may not fit in the icon's space when it's translated. If an icon requires a textual explanaton, it is better to use Mac-style Bubble Help.
There are a number of books available on this subject. A good one is
Global Interface Design
by Tony Fernandes (Academic Press, ISBN 0-12-253790-4).
--U.F.
Use Globalization Resources
It's a
big planet, but you're not alone. From international standards bodies to companies that specialize in internationalization, you have help in getting your software ready for global markets.
Localization Help
The Localisation Industry Standards Association (LISA;
http://www.lisa.unige.ch/
) is a private, nonprofit association in Geneva, Switzerland. LISA cooperates with industry partners, providing support for software localization.