Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesGlobal from Day One


March 1997 / Features / Global from Day One

Internationalizing code from the start minimizes costs and leads to big payoffs.

Holly Hubbard Preston and Udo Flohr

In the same way that little software companies often become big ones, domestically oriented software companies often become internationally competitive. The problem is that while many independent software companies like to think big, they do not always like to think international. There's a good reason for this. It's costly to localize and translate a software application -- and its related documentation -- for an overseas market.

Berlitz Translation Services, a division of the international language-services giant, estimates that its software-developer clientele pay anywhere from $50,000 to $100,000 or more to fully localize and translate a product for a single lan guage . Microsoft places its translation and localization costs at as much as $300,000 or more, depending on complexity.

The process is also time-consuming. Companies spend an average of six months to a year to prepare a localized-language version of software. This includes bringing in localization experts to check for technical, cultural, and lingual accuracy.

Money Makes Money

If it is such a hassle, why bother? Companies do it because selling localized-language versions is big business. The Software Publishers Association (Washington, DC) estimates that of the more than $28.7 billion in packaged business software sold in 1995, 60 percent of those sales were outside the U.S.

For many software companies, a global product strategy is akin to planning a building in which you will house your company five years from now. While you might not need the opportunity today, will you tomorrow? As Orlando Ayala, Microsoft's vice president in charge of int ercontinental sales and marketing, notes: "There is a very direct relationship between growth and the ability to produce localized versions." Over 50 percent of the company's more than $8 billion (fiscal 1996) revenues came from outside the U.S.

Although tempted by the opportunity to make money offshore, many independent software developers opt to focus their sole attention on creating an English-language version of their software. It is only when that software is selling strongly in the U.S. market that they start to think about localizing their product for other markets. While such an approach may save money in the near term, it can prove to be costly over the long haul. An English-language software application not written with a foreign-language port in mind means that the developer will not only localize the software but also rewrite a significant portion of its code. Meanwhile, your better-prepared competition beats you.

Preparing for World Markets

Thanks to key deve lopments in OS software, in the evolution of Unicode (see the article "Unicode Evolves"), and in other areas, independent developers now have more opportunities to prepare their products for localization long before they even think of stepping out of the country or their home state.

While support for Unicode in OSes and other developments make it easier for an independent software company to plan for future localization, they by no means ensure success. It is critical that developers lay some groundwork of their own at the code level to ensure that their product is fully Unicode-compliant and has enough extensibility to sell overseas. Here's what the specialists at Berlitz Translation Services say developers need to be aware of when writing English-language versions of their product:

1. Be sure your application is Unicode-enabled, meaning that it can accommodate single-byte (e.g., Arabic, Hebrew, and Cyrillic), double-byte (e.g., Chinese and Japanese), and Roman character sets.

2. Understand exactly what the OS you intend to write for supports as far as Unicode features. Different versions of the OS may have done different things with Unicode.

3. Do not make assumptions that your printer and display drivers can accommodate Unicode in final output. For instance, some character sets when displayed read from right-to-left or from the bottom of the screen to the top.

4. Avoid hard-coding certain standard items such as input fields, fixed character width, dialog boxes, and help definitions. Your application will need flexibility. Foreign languages tend to take 20 percent more space than English.

5. Avoid string concatenation. Separate strings of precoded phrases don't come together the same way in one language that they do in others. For example, stringing together the local string for "file," the local string for "error," and the local string for "has occurred" may not give the local stri ng for "file error has occurred."

The key point to following these steps is to make sure that, as a developer, you are aware of what you need, API-wise, to go global long before you do. Developers who don't follow these steps will have a difficult time localizing their code for different parts of the world. There are too many idiosyncrasies from one country to the next to be ready to localize a product on an ad hoc basis.

Your globalization chores will become easier if you focus on the areas mentioned in the following sections. --H.H.P.


Get Help from the OS

Developers should make their application as culturally and human-language independent as possible. The OS can help them. Rather than writing quick-and-dirty language-dependent code first and then rewriting it later, programmers should use the routines provided as part of the OS for tasks such as text input and output. Apple's Macintosh Toolbox was the first to provide this type of service, but Windows and Unix environments have long since stepped up to the same standards. Just as Windows programmers don't have to write their own printer drivers anymore, there is no need to try to handle double-byte character I/O "manually."

Thanks to these developments on the OS front, independent developers now have more localization options. Among the most important developments is the agreement between leading software companies for a Unicode character-encoding specification that provides OS-level support to multiple-byte as well as single-byte character sets.

Another key development taking place is that OS giants such as Microsoft and Apple are in the process of further streamlining localization into their overall product-development efforts. For instance, Microsoft is centralizing product development so that localization experts, programmers, and product managers work in tandem rather than in phases.

Apple, currently overhauling its Mac OS strategy, is cr eating a Software Development Kit (SDK) that will make localization support automatic as opposed to optional. --U.F.


Stick to the Script

International development efforts usually revolve around text output in the target language's writing system. What do users want and expect? Character codes, fonts, and scripts are good places to find the answers.

A writing system, also called script, consists of rules for creating a visual representation of language and an accompanying character set. More than 30 character sets are in use today throughout the world, including Chinese, Arabic, Roman (for English), Cyrillic, Japanese, and Hebrew.

These systems differ in their approaches for creating graphic representation of words. In alphabetic scripts such as Roman, Greek, and Cyrillic, the characters typically stand for the basic sounds (i.e., phonemes) of the corresponding language. In syllabic systems (e.g., Japanes e kana), characters represent syllables. Complex scripts, such as Chinese Hanzi and Japanese kanji, use up to 30,000 ideographic characters: They stand for sounds and incorporate meanings of words.

Sorting Scripts

From a programmer's technical perspective, you first need to categorize scripts according to their unique characteristics as follows:

  • Simple scripts, such as Roman and Cyrillic, use left-to-right lines and fill pages top-to-bottom. With fewer than 256 characters, you can represent each in a single byte. They do not need context information.
  • Complex scripts with large character sets (e.g., for the languages of China, Japan, and Korea) need 2 bytes for each character. There usually are no spaces between words. For direction, they use different combinations, including top-to-bottom lines, left-to-right lines, right-to-left pages, and top-to-bottom pages. (Periodicals are usually printed in vertical columns, while technical documents are often displayed in left-t o-right lines.) These scripts do not require context information. The characters may have no sorting order that would correspond to alphabetic sorting in the less complex scripts and may also have no uppercase or lowercase.
  • Context-sensitive systems will also use fewer than 256 characters, but a character may look different depending on its context (i.e., the surrounding characters). This is similar to handwritten Roman text. An Arabic letter has up to four possible shapes (i.e., glyphs). While typing Arabic text, previously entered characters change in appearance.
  • Bidirectional scripts, such as Arabic and Hebrew, use right-to-left as their main direction, but numbers and interspersed words from Roman scripts are written left-to-right. They have fewer than 256 characters. Hebrew is context-independent; Arabic is both bidirectional and context-sensitive.

Cluster Control

Some scripts, such as Thai, Korean, and Hebrew, have character clusters. This phenomenon, which is s imilar to but more complex than something such as accents in French or Spanish, requires special consideration from the programmer. Highlighting and deleting text, as well as the movement of the insertion cursor, have to treat the clusters as single characters. Clusters can have up to five components.

Ligatures are a special case of clusters. In Roman scripts, it's usually a sequence of two characters that acts as a unit. To capitalize a ligature, for example at the beginning of a sentence, both characters must change to uppercase.

Boundary Markers

Even if there are no clusters, character and word demarcation -- finding boundaries between words and characters -- can be challenging. For example, even in a seemingly simple Roman text, we do not break lines directly before most punctuation marks; but it is OK to break the line before an opening parenthesis. In bidirectional scripts, this is more complex.

In many Asian systems, there are no word delimiters, so breaking lines or co lumns requires special algorithms . In Japanese (with no spaces between words), line breaks are allowed anywhere within a word, but you must not split multicharacter symbols.

Word boundaries are often difficult to define. For example, both Feueralarmschalter in German and fire alarm switch in English are essentially compound nouns, but in English, there are spaces between each of the words.

Special typesetting styles (e.g., boldface, italic, and underlining) may not translate well to another writing system. Different cultures also have different conventions for expressing emphasis.

Developers should prepare for all kinds of sorting preferences and strategies. Even in the relatively "safe" realm of Roman scripts, there are variations. In English, sorting is from A to Z . In German, characters with an umlaut sort directly after the character without an umlaut; but in Swedish, Ö comes last in the alphabet. Spanish has double charac ters (e.g., ll and ch ) that sort as single characters.

As any Westerner who has tried to look up an expression in a Chinese dictionary knows, sorting complex scripts is difficult. One criterion here is the radical (i.e., root) of the character, the number of basic strokes that is needed to create the character. In these multibyte scripts, multiple character patterns may stand for the same word and should thus sort together. This could require multilevel sort algorithms.

Editing Issues

These categorizations raise a number of issues about editing functions. For example, to convert between uppercase and lowercase, programmers familiar with the standard ASCII character set often add or subtract 32. (This constant is the difference between the numerical ASCII codes for A and a .) But even for the extended character set needed to represent Western European languages, this approach may not work. For example, the difference between Ä and ä may not be 32. The accent or umlaut may also have to disappear in uppercase.

The same challenges for sorting and capitalization apply to search algorithms. String searches must be capable of accepting different character sequences as equivalent. Wild-card characters that are used in Find dialog boxes may have a meaning in the target language and might therefore be unusable.

In a speed-search situation, where you type the first letters of a word, both accented and unaccented occurrences should appear. (If I type a , I expect to get words starting with A , Ä , a , à , and so forth.)

Hyphenation of long words at line breaks is not as easy in other languages as it seems in English. In German, characters in a word may have to change to hyphenate it. French sometimes requires a hyphen if two otherwise separate words extend beyond line's end.

Inflection Impact

English-speaking developers should keep in mind that ot her languages may have much more inflection (i.e., changes to words) depending on tense, case, and gender. While French and other Romance languages have two genders, German has three, and there is one language that has 17. This may, for example, influence the way you write ordinal numbers.

There is no magic bullet for dealing with the differences among languages. Awareness of the possibilities and flexibility in the code are the most important guidelines. You'll still get surprised occasionally, but not as often as if you just coded your program and dumped it into the Translate-O-Matic. --U.F.


Avoid the Global Faux Pas

Icons, metaphors, and symbols are some other important issues for developers who need to create software applications that are used internationally. Among the things to watch for are symbols or even colors that may be offensive in other cultures.

In addition, some icons are not easily recognizable across cultures. One example is a mail applicaton wth a mailbox icon that raises the red flag when mail has arrived. This type of mailbox is used only in the U.S.

While you must consider these issues carefully, the user interface should still use symbols as much as possible. Many international symbols are widely understood. In Europe, traffic signs are generally symbols wth no text, because drivers who do not speak the language will have to understand them.

A design goal of icons is that they should contain no text, because it may not fit in the icon's space when it's translated. If an icon requires a textual explanaton, it is better to use Mac-style Bubble Help.

There are a number of books available on this subject. A good one is Global Interface Design by Tony Fernandes (Academic Press, ISBN 0-12-253790-4). --U.F.


Use Globalization Resources

It's a big planet, but you're not alone. From international standards bodies to companies that specialize in internationalization, you have help in getting your software ready for global markets.

Localization Help

The Localisation Industry Standards Association (LISA; http://www.lisa.unige.ch/ ) is a private, nonprofit association in Geneva, Switzerland. LISA cooperates with industry partners, providing support for software localization.

Componentized Unicode

Gamma UniVerse (from Gamma Productions, which is a developer of cross-platform foreign-language products that are based on the Unicode standard) is an ActiveX control that brings Unicode support for existing applications and OSes, potentially covering more than 175 languages. It enables developers to write and maint ain global applications for the borderless world of the Internet/intranet.

Used as a stand-alone multilingual editor application for Windows or embedded in applications supporting any OS, Gamma UniVerse lets applications written in C++ or Visual Basic transparently support the Unicode character-encoding system. It is compatible with Microsoft's ActiveX cross-platform standards.

Browsers with an Accent

Accent Software International is one of the best-known players in the multilingual-software arena. Its Internet With an Accent is a multilingual Web viewer and publisher. It includes four components. First, a multilingual version of the Mosaic Web browser interprets a Web page's characters. Users select one of over 30 languages and character sets from a menu in the toolbar. A free Netscape plug-in is also available.

Second, a Hypertext Markup Language (HTML) editor takes the hassle out of writing multilingual Web pages. The third component is a viewer for documents created with A ccent's word processor. Finally, it includes MailPad, a multilingual e-mail application.

Tango Browser, from Alis, also displays Web pages in dozens of languages. With Tango Browser, you can select its interface language, automatically retrieve pages in the language you prefer, and input text in a wide variety of languages. --U.F.


Where to Find


Berlitz Translation Services

Santa Monica, CA
Phone:    (310) 260-7100
Phone:    (310) 260-7185
Internet: 
http://www.berlitz.com


Gamma Productions, Inc.

San Diego, CA 
Phone:    (619) 794-6399
Fax:      (619) 794-7294
Internet: 
http://www.gammapro.com


Unicode Consortium

San Jose, CA
Phone:    (408) 777-3721
Fax:      (408) 777-0405
Internet: 
http://unicode.inc@unicode.org
 

HotBYTEs
 - information on products covered or advertised in BYTE


Coding Practices That Make Localization Easier


--
 Use Unicode for all character processing.

--
 For any non-Unicode data (e.g., fonts and code page numbers), do not hardcode.

--
 Isolate code functions that require script-specific modification.

--
 Avoid hard-coding user-visible strings. Use resource files
 instead.

--
 Avoid run-time concatenation.



Globalization Glossary


Character code:
 A unique integer value that signifies a character in
a script.


Character orientation:
 The rotation of the characters in relation to
the script's line orientation. It is called with-stream when it goes
in the same direction as the line orientation and cross-stream
otherwise. Horizontal line orientations (left-to-right or
right-to-left) usually lead to with-stream character orientation.
Vertical line orientation (top-to-bottom) yields cross-stream most of
the time, but with-stream (i.e., a vertical character baseline, a
90-degree rotation) is also possible.


Encoding:
 The mapping between characters (i.e., the character set)
and their character codes, which are unique integers. Sometimes a
character appears in different character sets or more than once in
the same character set; this is because t
he same character can be
used differently in another script. An example is the character H,
which is used differently in Roman than in Cyrillic.


Glyph:
 The shape of a character. In context-sensitive scripts, it
depends on the surrounding characters.


I18n, L10n:
 Even in an acronym-laden industry such as this, these two
have to be among the crankiest. I18n stands for internationalization;
the 18 signifies the number of characters between the i and the n.
L10n is, similarly, localization. Many people argue that cryptic
acronyms such as these defy the very purpose of globalization.


Internationalization:
 Preparing a product for international markets
while retaining it as a single version; that is, it does not yet
contain features that apply to only one language or script. Thorough
internationalization, as a first step, saves cost in the next stage,
which is localization.


Line orientation:
 The text flow direction within a lin
e. For
instance, Arabic generally uses right-to-left line orientation. In
Japanese, it is either top-to-bottom or left-to-right.


Localization:
 Preparing a product for a single locale. This usually
involves translating the user interface and documentation and
adapting time, date, and number formats. However, it often doesn't
stop there. The different script may require more dramatic changes,
and sometimes icons, symbols, metaphors, and even concepts have to be
reconsidered.


Script:
 A writing system for creating a visual representation of
language. It consists of a set of characters and rules on how to
combine them. Examples for scripts are Chinese, Arabic, and Roman.




Six World Views

Even this small sample shows the variety of ways numbers, dates,
times, and currencies are represented in different countries.  

         
Digit        Currency               Time              Short date  

       
separators


U.S.    1,234.56       $0.23 ($0.45)     9:05AM 11:20PM      12/22/96 2/1/96

U.K.    1,234.56       £0.23 (£0.45)     09:05 23:20     22/12/96 1/02/96

Germany 1.234,56       0,23 DM 0,45 -DM  09:05Uhr 23:20Uhr   22.12.1996 1.02.1996

France  1 234,56       F 0,23 -F 0,45    09:05 23:20         22.12.1996 1.02.1996

Greece  1 234,56       Dr 0.23 Dr (0.45) 09:05 23:20         22-12-96 1-02-96

Japan   1 234.56       ¥0.23 (¥0.45)     09:05AM 11:20PM     96.12.22 96.2.1




Split Personalities

illustration_link (30 Kbytes)


Holly Hubbard Preston is a free lance writer based in Palo Alto, California, covering computer markets around the globe. Her work has appeared in trade and mainstream publications both here and abroad. You can reach her at 71021.1641@compuserve.com . Udo Flohr is a BYTE contributing editor based in Hannover, Germany. To reach him on the Internet, send e-mail to flohr@dfn.de .

Up to the Features section contentsGo to previous article: Go to next article: Unicode EvolvesSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network