Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

Articles10 Years Ago in BYTE


March 1996 / Blasts from the Past / 10 Years Ago in BYTE

Kurzweil meant quality then...still means quality today.

Raymond Kurzweil, founder and chairman of Kurzweil Applied Intelligence, wrote about speaker-independent, voice-activated word processors that let you write by talking. Initial price was $20,000. Now that PCs are powerful enough to tackle the daunting task of speech recognition, such systems are now available for under $1000. And as Rob Dieterich's article "I'll Talk To You Soon" indicates, prices will get even lower.


The theme for the March 1986 issue was "Homebound Computing" of which Kurzweil's article "The Technology Of the Kurzweil Voice Writer" was an integral part. Not a product review, this article explored voice technology.

The Technology of the Kurzweil Voice Writer -- The present office system provides a clue to future applications for the deaf

by Raymond Kurzweil

Editor's note: This article is not a review of the KVW; it is a look at a technology that may be available on personal computers in the future.

The Kurzweil Voice Writer (KVW) is a voice-activated word processor with a relatively unrestricted user-specific vocabulary. The system starts with a vocabulary of at least 5000 frequently used words in the English language. It subsequently adds the words you use that are not part of its initial vocabulary and eventually deletes those words that you never use. Total vocabulary, depending on the KVM model, will be in the 7500- to 20,000-word range.

Voice is our most effective and rapid means of communication, and the ability to interact with computerized information services and devices by voice, without the restrictions of artific ial vocabularies or syntax, is expected to be of major benefit. The primary application of the KVW is to automate the creation of written text, which is a fundamental activity in the office. Combining large-vocabulary ASR (automatic speech recognition) with natural-language understanding would also enable professionals and executives to make inquiries of database-management systems or management information systems verbally in natural language instead of through a keyboard.

One planned application of this technology is to create a speaker-independent version of the KVW to serve as a display telephone for the deaf. This would enable a deaf person to hold a phone conversation without being restricted to speaking to other deaf people who have compatible TDDs (telecommunications device for the deaf). It is not yet available but the technology that will be used in its creation is described in essence in this article.

The KVW as it currently exists requires only that you can speak and that you can see. Motion and hearing impairments are not obstacles in its operation. The current version of the KVW is for the business community, but it fills a need for many disabled persons as well. The initial KVM model, which can be shared by multiple users (one at a time) is expected to be introduced this year at a price under $20,000. Future models of both single-user and multi-user systems are expected to be in the $4000 to $10,000 range. While this is beyond the price range of most individuals, the technology is the clue to future, more individually affordable solutions.

Large-Vocabulary ASR

There are two difficulties involved in creating large-vocabulary ASR. First, you must create a set of linguistic and speech-recognition algorithms that provide the requisite recognition power and that are capable of resolving the fine distinctions and ambiguities that are inevitable when you deal with a large natural vocabulary. The incidence of "perplex clusters" (words that differ by only one phonetic fe ature) is much higher for a natural vocabulary than for an artificially created command vocabulary. Indeed, many words do not differ in sound at all (homonyms), but can be differentiated only by context. For example, if you want to recognize "To be or not to be, that is the question." you must deal with the first six words. each of which represents a perplex homonym set: (to, too, two. 2); (be, bee, b); (or, oar); (not, knot); (to, too, two, 2); (be, bee, b). Of the 576 possible phrases, all are acoustically correct. but only one is linguistically correct.

Second, you must provide the necessary computing power. Running the algorithms for the KVW on a sequential computer of Motorola 68000 power requires over an hour per word. One reason that the algorithms require this amount of computation is to provide the very high degree of precision needed to deal with the perplexity of a large natural vocabulary. Significant computation is also required to perform the transformations and property-extraction algorit hms required to deal with the numerous sources of speech variance that such a system is subject to. Parallel processing provides the speed improvement of several thousandfold necessary to achieve a real-time response time of 250 milliseconds.

The KVW architecture incorporates multiple microprocessors and uses dedicated implementations of specific algorithms in custom VLSI (very-large-scale integration) and discrete circuits. This significantly increases the effective computation throughput levels. A current industry trend finds parallel arrays of dedicated implementations of algorithms in custom VLSI replacing the conventional architecture of a single programmable processor with its one memory space, software, and appropriate peripherals.

Vocabulary

One type of information that adapts as you use the KVW is the active vocabulary. The system starts out with a vocabulary of at least 5000 common words in the English language. The first time you use a particular

word that is not in this starting vocabulary the system won't be able to recognize it, and you will have to either type it in or verbally spell it in. This process is required only the first time you use a new word; the system will add the word to the active vocabulary and should subsequently be able to recognize it when you use it again.

Words continue to be added until the vocabulary reaches its maximum size, which will vary depending on the model. (The vocabulary size required will vary from user to user. It is expected that user-specific vocabularies in the 7500-to 20,000-word range will ultimately be provided.) After this. new words continue to be added, but the system must drop words from the original set that you have never used. The final result is a vocabulary that should cover the vast majority of words that you use.

Multiple Experts

Rather than select a single technique such as Markov modeling, dynamic time warping, robust feature analysis, or high-level feature extraction, the KVW te chnology incorporates multiple experts, each of which uses a somewhat different approach to the problem of large-vocabulary speech recognition. Different approaches to a complex pattern-recognition task such as ASR have different strengths and weaknesses, and a system that incorporates a variety of techniques is likely to provide better performance than a system that relies on a single method.

Some of the experts run in real time on conventional 68000 microprocessors, while others require specialized parallel circuitry to provide real-time performance. In this specialized circuitry 68000 microprocessors provide function control and sequencing, while the circuitry acts like peripherals to them. The resulting architecture consists of multiple 68000s, each with its own RAM (random-access read/write memory) space, plus specialized circuitry incorporating additional RAM spaces.

To take maximum advantage of a multiple-expert strategy you must combine the results from each expert in a way that recognizes its unique strengths and weaknesses. In general, the system can quickly and accurately resolve each recognition within a small perplex set of words. After this initial cut of the vocabulary to a small set (ranging from one word to a few dozen), the expert-management techniques depend to a great extent on the nature of the resulting perplex set. Some of the expert-management techniques are knowledge-based. For example, the handling of homonym sets is done through a single expert that is capable of differentiating between homonyms based on context. Other techniques involve probability: the methods of combining the probabilities from each expert are controlled by statistics on how the various experts have performed for different types of perplex sets. Some of these parameters are derived from statistics gathered during a particular user's time on the system and thus form part of the overall user-adaptation process.

Language Experts

A number of experts try to predict the likelihood of dif ferent words occurring at a particular lexical entry point based entirely on context. These experts use a variety of information theory as well as sentence-parsing techniques.

The sentence-parsing expert is similar to the type of parser used in some natural-language understanding programs in that a tree-like structure is generated showing the part of speech of each word and its relationship to other words in the sentence. One significant difference is that the KVW parser is able to generate parses on incomplete sentences. At a particular point in a dictated sentence, we have only the "left" part of the sentence (from the beginning up through and not including the current word). Based on each parse of the incomplete sentences as they come in, the parsing expert is able to assign probabilities to different parts of speech. Rather than the eight or nine basic parts of speech that grade school children are familiar with (noun, verb, adjective. etc.), the KVW parser uses approximately 200 types representing subcategories of the basic parts of speech. This degree of specificity enables the parsing expert to increase the value of its predictions.

Using a lexicon of approximately 50,000 words that indicates the likelihood of different parts of speech for each word, the parsing expert is able to assess the likelihood of different words. In particular, the parsing expert is good at eliminating choices that are syntactically unlikely.

There is a fortunate orthogonality between the strengths of the acoustic experts and those of the language experts. For example, most homonyms represent significantly different syntactic types that can be determined from context. "Two," "to," and "too" represent very different grammatical categories with readily identifiable word contexts. Also, short function words, which tend to be more difficult for an acoustic recognizer, are actually easier for the language model to make predictions for.

Acoustic Experts

The acoustic experts share an acoustic fro nt-end processor that includes a high-resolution digitization (over 96-decibel dynamic range) and a robust filter bank made up of several hundred two-pole filter elements with 24-bit accuracy The resulting spectral data is subsequently processed through a series of normalizations and other transformations to reduce variability and preserve feature invariance. Some of the transformations are based on an auditory model similar in many ways to the human ear's auditory front-end processing.

The acoustic experts utilize a RAM storage of word models. which are updated after each utterance. The acoustic experts are capable of evaluating the likelihood of every word model for a given test token, although the expert manager may request that a particular acoustic expert evaluate only a subset of the models based on the results of earlier experts.

Parallel-Processing Architecture

One area that uses extensive parallel processing is the front-end filtering. In order to make the fine distinct ions necessary to handle the perplexity of a large vocabulary a great deal of accuracy and resolution is needed in the number of filter channels and the accuracy of both the sample stream and the filters. Filtering is handled by the KSC2408 filter chip (from Kurzweil Semiconductor, a division of Kurzweil Applied Intelligence Inc.) with several two-pole filters used for each filter channel. Implementing the 2408's filter algorithm (for a single two-pole filter) on a 68000 requires five seconds to process one second of speech. or five times real time. Each KSC2408 chip includes eight such filters (which operate in real time) and is thus equivalent to forty 68000 microprocessors (for the 2408 filter algorithm). The current Model I KVW uses 25 KSC2408 chips, which is equivalent to using a thousand 68000 microprocessors for the filtering operation.

The equivalent of several thousand additional 68000 microprocessors (for certain dedicated algorithms, not for general-purpose computation) is provided by other s pecial circuits used in the acoustic-matching process. The language experts and elements of the acoustic-recognition process such as normalization and other transformations are handled by multiple conventional microprocessors.

Using the KVW

In dictation mode, you simply speak your text in a rapid, discrete manner, with brief pauses between words. The pause required between words is adjustable and should be set just long enough to reduce or eliminate the ambiguity between word pauses and stopgaps within a word. In general, this figure ranges from 100 to 250 milliseconds. The system responds within 500 ms after the end of each word by displaying the recognized word on the screen. A special status line displays any alternate word choices. In trials of the KVW when the system has chosen the wrong word, the correct word has usually been the first or second alternate given.

The basic mode of operation is to speak into the system and watch the text appear. You don't need to be aware of what is in the active vocabulary. You simply speak and let the vocabulary-adaptation process proceed automatically.

You can also enter commands by voice. To distinguish commands from text, you enter a command mode either by depressing a function key or by speaking an appropriate unique verbal "Enter command mode" instruction (for example, "blix"). Once you enter command mode, you can switch among different types of commands to go, for example, from application-program commands to operating-system commands.

The primary mode of integrating the KVW's capabilities with an application program is through "transparent" integration. In this mode. the KVW simulates the keyboard. Recognized text and commands are converted into appropriate character strings and transmitted to the operating system as if they came from the keyboard. The character strings come in through a special serial line and an appropriate driver intercepts them and presents them to the operating system as having come from the keyboard.

User Interface

One user interface that has been proposed for ease of use includes a pointing device (such as a mouse) to control the cursor, which is not easily manipulated by either keystrokes or verbal commands. The mouse would have two buttons, one to toggle between text and command mode and the second to correct errors. Again, you would have the choice of using these two buttons or using verbal commands. You would have relatively little use for the keyboard. Being able to correct most errors, go back and forth between text and commands, and control the location of the cursor would provide most of the control necessary aside from the actual verbal dictation of the text and commands.

To take this concept one step further, you could combine a flat-panel display with a touch-sensitive surface to provide a "pad" that you would hold in your lap or on your desk. As you speak to the pad. words would appear on its surface display. To control the cursor for insertion, deletion, or repl acement operations, you would simply point to the screen. The two basic functions of error correction and toggle-to-command mode would be provided by either displayed "buttons," real buttons, or voice command (at your option). For the occasional requirement to type in a new vocabulary word, a QWERTY keyboard could be displayed on the screen.

Physical Configuration

The KVW consists of an approximately 100-megabyte Winchester disk, four circuit boards, and a power supply in a standard rack-mountable cabinet. While it would be possible to sit the KVW server next to the work-station it serves, it is generally found in a separate location. Thus, you interact only with your workstation and a microphone. The microphone can be either head-mounted, worn on your lapel, or desk-mounted. It is connected to a small box that digitizes the signal and transmits it on a high-speed serial line.

Future Directions

Future applications of the KVW technology include integration with natural-language-understanding systems, domain-specific expert systems, text-to-speech synthesizers, and a variety of application packages to provide executive assistants that are powerful and easy to use. Such systems will have access to the internal databases and MIS (management information system) information of the user's own organization as well as public, semipublic. and restrictedaccess databases accessed by telecommunications. Professionals, executives, students, and others will be able to converse with such systems to conduct rapid research and inquiry into a variety of questions of interest. Such questions might involve information retrieval ("How did the sales in our Western region for the past quarter compare to those of our three largest competitors?") as well as substantive analysis ("Which financing option for the proposed capital acquisition is best supported by our current balance sheet?"). Questions would be asked by voice in natural language. The questions would be clarified through two-wa y voice communication (or display), and final answers would be provided by either voice, display. or printout, as appropriate.

The acoustic experts in the KVW are adaptable to continuous speech input. The computation requirements must be increased to handle connected speech, as must the recognition power requirements to handle the additional perplexity of word segmentation. interword coarticulation, and function word reduction. It is expected that economically viable systems that can handle continuous speech will follow discrete-word KVWs within a few years.

The KVW techniques are also adaptable to European languages. The acoustic experts require very little change. The principal changes necessary to the language experts are (1) to provide the appropriate grammar rules to the parsing expert (although the parsing-expert algorithms themselves don't require substantial change) and (2) to train the language experts on appropriate foreign-language text. Foreign-language KVWs will probably follow the En glish KVW within a few years.

Handling Japanese requires more work than do European languages such as French or German. While Japanese has only about 120 syllables (compared to around 10,000 in English), the syllable set is a perplex one, with many syllable pairs being distinguished only by the duration of the vowel. Also, the differences in Japanese syntax require modifying more than just the parsing expert's grammar rules. Most of the KVW's techniques are, however, appropriate to the language, and a Japanese machine is feasible.

A number of configurations of a speech-to-display sensory aid for the deaf using the KVW technology have been proposed, which the company plans to pursue. Alternatives range from a speaker-independent version of the KVW (with an increased error rate) to a system that displays phonetic transcriptions rather than words. Such a phonetic transcription would contain some insertion, deletion. and substitution errors but could be understood by the user with appropriate training .

Conclusion

The introduction of large-vocabulary ASR is expected to provide dramatic productivity gains in creating written text, an optimal mode of communication between persons and intelligent computerized devices and services for information retrieval and analysis, as well as improved understanding and communication for the deaf population.


March 1986

photo_link (99 Kbytes)


Raymond Kurzweil is the founder and chairman of Kurzweil Computer Products, Kurzweil Music Systems Inc., and Kurzweil Applied Intelligence Inc. He received a B.S. degree from MIT and an honorary Ph.D. from Hofstra University. He can be rea ched at 411 Waverly Oaks Rd.. Waltham, MA 02154.

Up to the Blasts from the Past section contentsGo to previous article: Go to next article: 15 Years Ago in BYTESearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network