Creating or retrofitting software for another country requires attention to myriad technical details involving language, translation, and interface
L. Chris Miller
There is no doubt that the U.S. dominates the world's software production--75 percent of currently installed packaged software worldwide is produced in the U.S. International sales make up half the revenues for the top 100 U.S. software companies.
International sales were responsible for 55 percent of Microsoft's $3.2 billion in revenues for fiscal year 1992. Ken Fowles, manager of global evangelism at Microsoft, reports that there is huge growth potential outside the U.S. for all Windows products. "Windows presently ships in 27 languages. All versions except Japanese will run using the U.S. version of DOS," he says. Th
e Japanese version of Windows requires the Japanese version of DOS to support special hardware issues and incompatibilities specific to Japan.
The international software market is a multibillion dollar opportunity. The Software Publishers Association of Washington, D.C., reports that the four largest international markets for U.S. publishers are the U.K. and Ireland, Germany and Austria, France, and Japan. The U.S. has a 90 percent share of the U.K.'s $1.5 billion software market.
Definitions
Despite the continuing presence of American English versions of software products throughout the world, U.S. software publishers are responding to increasing international demand for software adapted for specific locales. The adaptation process is called localization, which is often abbreviated as l10n, because there are 10 letters between the letter l and the letter n in the word localization.
It takes a great deal of work to localize or retrofit a software package that was designed in American E
Localizing or retrofitting a product for another locale usually involves reengineering much of the underlying code. This is because software is not a static entity; it must act on data in accordance with the rules or conventions of a given locale. Reengineering the code for each locale is time-consuming and expensive.
Software companies are beginning to realize t
hat tremendous savings and increased revenues are possible if the software is initially designed with features and code that are prepared to accept international conventions, foreign data, and format processing. Designing software that can provide the necessary support for the languages of the intended markets is called internationalization. Building internationalization (abbreviated as i18n) into a product minimizes or eliminates the need for engineering revisions and greatly simplifies the localization process.
Internationalization also reduces the time lags inherent in localizing software for multiple markets. Software developers have become sensitive to the issue of simultaneous release. Quark of Denver, Colorado, recently delayed the shipping date for its QuarkXPress 3.3 desktop publishing program to avoid compatibility problems for end users transferring files from U.S. companies to overseas offices. Multinational companies sharing files can appreciate this gesture. Another advantage to the simul
taneous release of software in world markets is that it allows coordinated marketing campaigns.
Internationalization Approaches
Software developers generally use one of three approaches to isolate and protect the core algorithms:
Compile-Time Internationalization. Programmers change the files that contain the source code and algorithms.
Link-Time Internationalization. Programmers extract all the text strings, along with algorithms that are dependent on language or culture, from the program code.
Run-Time Internationalization. End users select locale. The software contains text files for more than one locale.
One interesting example of this last approach is embodied in Compaq Computer's Presario line of computers. The first time you boot up one of these systems, it asks you to choose the language you want to work in. However, if you inadvertently select the wrong one, you may be in trouble: The system automatically erases all supporting files for other languages. These files ar
e the normal MS-DOS and Windows language-support files and could be easily restored from the distribution floppy disks, but Compaq doesn't supply the floppy disks with the machines.
Alphabet/Character Encoding and Unicode
Every language has its own alphabet or script; thus, alphabet or character encoding is fundamental to internationalizing software. Early computer standards for alphabet or script encoding made no provision for all the written languages of the world. Since computers manipulate numbers, text must be represented to the computer as a series of numbers, each number corresponding to a graphical representation of a character or glyph.
The most common coded character set is the ASCII 7-bit code set. ASCII supports 128 characters--52 uppercase and lowercase letters in the Roman alphabet and 10 numbers, as well as symbols, punctuation, a limited number of accents, and control codes. ASCII, one of the first code sets invented, was designed to support American English. Other coded chara
cter sets were created to support other languages.
Extended ASCII includes accented vowels like "a - accent a grave (French)" and special characters such as the German "double s". Windows uses a superset of the ANSI character set, essentially ISO 8859/x plus additional characters. The Latin-1 code set uses 8 bits (1 byte) to represent a character, which allows the representation of 256 characters.
Non-Roman languages with more than 256 characters use doublebyte, multibyte, or wide character sets. A doublebyte character uses 16 bits. A multibyte character set can mix single- and multibyte characters. A wide character set typically contains 16- or 32-bit characters.
Code sets differ across operating systems and language scripts. The best-known attempt to consolidate code sets is the Unicode standard. This character code standard, capable of encoding all known written languages, is being touted as the answer to efficient data portability among platforms. The Unicode Consortium, a nonprofit o
rganization in Mountain View, California, was founded in 1991 to develop and promote the use of the Unicode standard. Charter members include Apple, Xerox, IBM, Microsoft, Sun Microsystems, and Novell. The ISO, based in Geneva, Switzerland, approved Unicode in June 1992 as the international character-encoding standard (ISO 10646).
Unicode is a 16-bit code set that can produce more than 65,536 characters. Of these, 34,168 places are defined for most characters used in writing systems, about 6300 places are reserved for software and hardware developers to assign their own characters and symbols, and 25,000 places are available for expansion. With Unicode, each character is allocated a unique 16-bit value or number. Each 16-bit number is called a code point.
Using the Unicode standard eliminates the need for complex modes or escape codes to specify modified characters or special cases. Another advantage is that Unicode has built-in special control characters for handling changes in text direction w
ithin a single line of text.
The increasing list of software products implementing Unicode is promising. Yet the cost of converting preexisting software to Unicode-compliant status is still prohibitive for several software developers. Another problem area is program size--if every stored character is 2 bytes long, the software may require a significant amount of additional memory just to run. Thus, for a variety of reasons, it will be some time before international character encoding is standardized across platforms.
The Input Problem
So the software can deal with different character sets. But how do you get them into the computer? Keyboard drivers and mapping tables for a variety of code sets can contend with European and Asian languages. For example, French keyboards use an AZERTY keyboard rather than the QWERTY layout (the French just switched Q and W with A and Z).
Asian languages present the biggest challenge: entering non-Roman characters. The People's Republic of China's Hanzi i
deographic script has over 7000 commonly used characters. (An ideographic character script uses pictures or symbols to depict a thing or an idea.) Taiwan's standards require 13,000 characters. The Windows 3.1 version for Taiwan supports six different input methods:
-- Chang Jei--based on a public domain input method. Chinese characters are separated into two or more parts, or radicals. A radical is a part of a Chinese character that you can use to index the character; a character may contain more than one radical, but you can use only one of these as the indexing radical. These radicals are assigned to the keyboard letters a through w and y. The letter x is reserved for complex radicals, and z is used for selecting a duplicate word. Up to five keystrokes may be necessary to generate a single Chinese character.
-- Phonetic--based on a phonetic alphabet (four different keyboard layouts are used). The keyboard includes 37 symbols representing consonants and semivowels and five for audible tones.
-- Q
uick/simplified--a variation on the Chang Jei method.
-- Internal code--based on the Big-5 internal code, which is an unofficial code page used in Taiwan that contains about 13,000 characters; this is enough for everyday business use but omits many classical Chinese characters.
-- DA-YI--uses 40 defined basic radicals for character composition. This is currently the fastest input method found in Taiwan.
-- Array--10 defined basic keystrokes (numbered 0 through 9). The keyboard is used as a matrix, and the number of basic keystrokes is the index of the matrix. Each Array radical on the keyboard is determined by the first stroke and the last stroke of the radical (i.e., the row index and the column index determine the radical on the matrix of the keyboard).
Software for Japan must support three distinct writing scripts. Text typically contains an average of 55 percent hiragana, 35 percent kanji, and 10 percent katakana. Hiragana uses 46 Japanese symbols (cursive script rather than block let
ter form) to represent all sound combinations. Kanji refers to more than 7000 Japanese ideographs based on Chinese characters. You can select one of two popular methods for inputting kanji characters: You can enter the hexadecimal representation corresponding to the character, or you can use a katakana-to-kanji conversion. If you type the phonetic spelling of the character, then the phonetic string is translated into the most-likely kanji character. A good converter should select the right choice about 80 percent of the time. When the translation is incorrect, you're presented with phonetically similar ideographs to select from. Katakana consists of 64 phonetic script characters and punctuation, used typically for foreign words. (Note: Hiragana and katakana together are often referred to as kana.) In addition, Arabic numerals and Roman letters are occasionally used in Japan for phonetic spelling of foreign words (these are called romaji).
A "real" Japanese keyboard commonly has 106 keys (versus 101 for
typical U.S. keyboards). The extra keys are for toggling the Windows Input Method Editor, katakana-kanji conversion, and so on. On a U.S. keyboard, these functions are accessed by other key combinations.
Keyboard entry for Asian languages is understandably slow and tedious. Pen technology and handwriting recognition are being explored as possible solutions. There are already six pen-based Chinese character-recognition input devices available for Chinese Windows. Yet some ideographic characters are quite elaborate and require 12 pen strokes. Penkey (Orem, UT) sells a trainable print- and cursive-recognition system called Savant 2.0 that can handle many languages if the fonts are provided. Fonts for ANSI/Latin-1 languages and Japanese katakana and kanji come standard. The Savant 2.0 universal handwriting-recognition system includes built-in Unicode, JIS (Japan Industrial Standard), and ASCII switching.
In the future, voice-recognition technology will probably solve the laborious task of entering
Asian languages via a keyboard. Dialects such as Mandarin (with 37 basic sounds, each with four possible inflections) are distinctly and carefully pronounced. Voice-recognition software is already dealing with intonation, pitch, inflection, stress, and pauses. The end user can modify pronunciation rules and exception dictionaries.
Fonts
Arabic is a calligraphic cursive script with 28 alphabetics, 10 numerals, and several special alphanumeric characters. Each Arabic letter has up to four possible shapes based on its position in a word: isolate, initial, medial, and final (i.e., alone, first, middle, and last). The software is expected to analyze the letter's position in a word and change the letter shape accordingly. About 250 characters are necessary to produce good-quality text. A DTP (desktop publishing) program might include up to 900 characters. One interesting feature of Arabic is that, while you are typing, previously entered letters will be changing shape.
Sorting Sequences
Computer
software for international markets must be able to implement various sorting algorithms. Each locale has its own sorting-order preference for uppercase and lowercase letters, double characters, accented vowels versus nonaccented vowels, and numerals.
In the U.S., for example, the sorting preference is from a to z; but in Denmark, there are letters after z. In Latin America, the double character ch is treated as a single character and is placed after c and before d. In Germany, "o with umlaut" sorts with the letter o; however, in Sweden, "o with umlaut" is the last letter of the alphabet.
Writing Direction
Most Western languages are written in a Roman script horizontally from left to right and continue this pattern from top to bottom. Arabic and Hebrew characters are written horizontally from right to left, from top to bottom--but with numerals going from left to right on the same line. Traditional Japanese characters were written vertically from top to bottom, from left to right. The Japanes
e language is now also written horizontally from left to right.
Interfaces and Menus
You must allow enough space for text expansion when translating from English into another language. The Microsoft Windows SDK (Software Development Kit) recommends allowing 200 percent extra space for 1 to 10 English characters, 100 percent extra space for 11 to 20 English characters, and 30 percent extra space for 71 or more English characters. For example, the Preferences selection from the Windows menu would translate as Bildschirmeinstellungen in German. Boxes containing text should be self-sizing and movable.
Remember that software may not run (or run properly) if text files do not strictly meet certain technical requirements, such as character-length restrictions, string files, line links, command prompts, and other source code variables. Internal calls to related files will fail if you change filenames. Terminology consistency is crucial, since there can be hundreds of cross-references between the inte
rface, the documentation, the text files, and the filenames.
Translating Documentation
Special care must be taken with documentation translation. William Saiff, a technical writer in the Washington, D.C., metropolitan area, found an overwhelming need for guidelines to aid in the translation of technical and marketing materials. "Anything you do to make your information as clear and simple as possible promotes easier translation," he advises. "Avoid using English words with multiple meanings. For example, use because instead of since. Because has a single meaning; since can be confusing for a translator."
Terminology used in documentation should correspond to terminology in the software. Creating glossaries is important for maintaining consistent terminology. Microsoft publishes the GUI Guide--International Terminology for the Windows Interface, which includes standard translations for 14 European languages.
Translation of Text Strings
Leave room on the disks or plan for extra disks
to accommodate the increased length due to text expansion when translating from English into many non-English languages. File sizes of localized software will often be larger than the original English files. Also, be careful that file compression and decompression routines work properly with extended characters.
Translation and Terminology Tools
Many translation and terminology tools can accelerate the localization process. GlobalWare (Los Angeles, CA) offers three products for managing the translation process. These tools extract text from source code, formatting and hypertext codes, and document files. The translated text is then automatically returned to the correct locations in the source file.
XL8 Code extracts text from programming code (e.g., C, C++, Windows resource files, Macintosh resource files, Lisp, and Pascal) by using filters or code-definition files. You can select the platform (e.g., Unix, DOS, Windows, Next, or OS/2) and the character-code page. XL8 Help processes Windows he
lp files, while XL8 Text processes document files. Two important features are included in the three XL8 products: The glossary tool set lets you create a glossary and attach it to a file, and the leverage feature applies previous translations to the current file.
MCB Systems (San Diego, CA) markets the respected Trados line of translation tools. The Trados Translator's Workbench II is a translation editor that simultaneously accesses two databases: a terminology database, used to build custom glossaries, and a translation memory database that stores entire sentences as they are translated. This approach uses fuzzy logic to access previously translated terms and sentences, helping language professionals translate more efficiently and consistently. A tag recognition feature lets you localize files from various DTP programs, as well as from Windows Help and resource files. The Trados Translator's Workbench for Windows, expected later this year, will add a memory functionality to Word for Windows and WordP
erfect for Windows.
WorldScript
Apple's WorldScript technology provides built-in enabling for most written languages. WorldScript was released as part of the System 7.1 operating system for the Macintosh in October 1992.
System-software support for language scripts is streamlined with WorldScript. Each language script affects components such as character encoding, keyboard layout, input methods, sorting, formats, and fonts. Tables in the system resources specify script behavior, while WorldScript I and WorldScript II extensions do the processing. WorldScript eliminates independent development for scripts. Routines for 1-byte languages are included with WorldScript I. WorldScript II extensions provide routines for 2-byte languages. Support is included for right-to-left scripts, vertical scripts, in-line conversion, and third-party front-end processors.
Using the set of APIs along with a supporting language module eliminates the need for a specific foreign-language operating system. Prog
rammers can now write their software for a single operating system. Mac applications can be quickly localized, even for non-Roman languages. Apple intends to support Unicode in future releases of its system software.
Multilingual Software
The software industry is looking toward creating software designed to serve several markets without the need for localization. Such software is sometimes referred to as multilingual software. An early example was the 8/24 GC graphics card driver developed at Apple. The driver included all the strings needed for 14 languages in a single piece of software, automatically configuring itself for the language of the user by checking for the language in which the operating system was running.
Gamma Productions (Santa Monica, CA) offers a complete Unicode-compliant multilingual word processing and font system for over 100 languages. Universe for Windows 1.04 permits mixing any combination of languages supported. It is even possible to mix vertical and horizontal scr
ipts (e.g., Chinese and English) with correct rotation of characters and punctuation. Checking the spelling of multiple languages is possible in one pass.
Tools to Support Software Migration
The Gamma Server for Unicode is a 32-bit DLL that provides a systemwide set of resources for integrating multilingual processing in any Windows 3.1 or NT application for single-byte and dual-byte languages specified by Unicode. By using Gamma's API, an application can access a host of language-based services, including a wide variety of keyboard layouts; methods for converting user input to Unicode data; text services for determining character properties, contextual analysis, ligatures and font mappings; plus spelling checker, hyphenation, and thesaurus support.
Multilingual Computing Magazine and Buyer's Guide (Sandpoint, ID) offers information on the technical aspects of internationalization. The magazine covers new technologies, services for companies interested in localization, reviews of software too
ls and publications, and a calendar of events. Also, it has an extensive list of multilingual computing products.
The promise of expanding international markets with generally higher profit margins is motivating U.S. software developers to internationalize or localize their products. Developers considering foreign markets need to know that much more than translation is involved. Two excellent reference books that offer detailed overviews are Global Software: Developing Applications for the International Market by Dave Taylor (Springer-Verlag, 1992) and Software Internationalization and Localization: An Introduction by Emmanuel Uren, Robert Howard, and Tiziana Perinotti (Van Nostrand Reinhold, 1993). U.S. software companies will continue to maintain or gain a competitive advantage by adapting their products to meet the unique requirements of foreign countries.
L. Chris Miller, a computer consultant in Washington, D.C., has been involved in machine translation and software localization f
or many years. You can reach her on the Internet at
70303.314@compuserve.com
or on BIX c/o "editors."