Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesTalking to Machines


December 1995 / Features / Talking to Machines

The ultimate computer input device may be right under your nose

Judith Markowitz

You don't see the crew of the Starship Enterprise fussing much with keyboards. When someone wants to ask the computer a question, he or she normally just speaks to it. It's fast, efficient, and natural -- in Star Trek 's twenty-fourth-century universe. In our own time, however, keyboards and mice are a lot more important for computing than our voices are.

But talking to machines is too good an idea to ignore. Speech is such a basic and universal mode of communication that it's natural to want to talk to machines, such as computers and telephones.

What Did You Say?

Whether we use it to dial a telephone, navigate through Windows, dictate a letter, or enter data, speech recognition's basic job remains the same: to identify what a person has said, and to do so quickly, accurately, and seamlessly. It has to identify features from a continuous blast of speech and noise that spans the entire spectrum of audible frequencies.

The task is complicated by regional accents and speech habits (see the table "Wuzzatdoonear? Idano" ). We rarely notice them because we use nonverbal and situational cues to help us. Speech-recognition systems depend almost exclusively on acoustic data, yet we still expect them to perform as accurately as we do.

Several features influence the accuracy and speed of a speech-recognition system: the recognition algorithms used, the size and nature of the vocabulary, the grammar, whether speech is continuous or discrete, and the speaker model. These are summarized in the table "Speech-Recognition Features" .

I Recognize that Algorithm

Speech-recognition systems compare stored vocabulary models with spoken input according to specified recognition algorithms. No match will be exact, because slight differences in speed, emphasis, emotion, and other details change a word's acoustic patterns and length, even with just one speaker.

Speech-recognition systems represent words in different ways. Some systems use templates, which encode acoustic patterns from one or more samples and then compare acoustic patterns with spoken input, frame by frame. Most products, however, use hidden Markov models, or HMMs (see the sidebar "Hidden Markov Models"). Two recognition algorithms commonly used with HMMs are the Baum-Welch maximum likelihood ("best match") algorithm and the Viterbi ("best path") algorithm. Both process the input through an HMM and produce a probability rating.

HMMs are fast, efficient, and accurate, but the industry and the technology are evolving rapidly, and developers are investigating alternative approaches. One option, auditory modeling , attempts to reproduce operat ions of the inner ear and auditory nerve. Test systems have improved accuracy, speaker modeling, and noise rejection. Unfortunately, human auditory behavior is poorly understood, and full auditory models remain a long-range goal.

Artificial neural networks have begun to appear in commercial speech recognition. Neural nets can extract complex patterns from large quantities of messy data, which makes them well-suited for speech.

How Many Words Is Enough?

Even dictionaries with 100,000 words can't meet all needs, so some products allow users and developers to add words. The newest tools create new vocabulary items by combining HMMs. A system could construct the word unbirthday , for example, by assembling HMMs for each of its sounds. Or it could extract un from unbolt and attach it to birthday .

But large vocabularies can reduce performance or increase complexity. Searching a large vocabulary for every word takes time and increases the likeli hood that similar-sounding words will produce errors. Acoustics alone can't prevent a system from recognizing to when a user means two , or from selecting write instead of right .

Grammar Knows Best

Position and context can help in picking the right word. Language has an internal structure, which we refer to as grammar. For English, that structure limits word sequencing. Speech-recognition systems also use grammars to reduce or eliminate unacceptable word sequences.

The most common grammar for speech recognition is the finite-state grammar , which consists of a set of states connected by transitions, like HMMs without the probabilities. A finite-state grammar defines the paths a user can take through the application and specifies what words are acceptable at each state (known as its active vocabulary ). Limiting the active vocabulary speeds processing and helps minimize errors. Consider the figure "Who's o n First?" . This finite-state grammar has 14 words or phrases and specifies the active vocabulary.

Finite-state grammars are excellent for highly structured applications, such as inspections in manufacturing and voice control of a GUI, but they don't allow the freedom needed for unstructured dictation. For that application, statistical language models work better. These models contain probabilities about how likely it is that a particular word was uttered, given the identity of the preceding word (bigram model) or two words (trigram model).

Many telephone applications have to deal with unpredictable input. For example, a bank customer looking for a home loan might say something like: "I want to find out about mortgages" or "I wanna buy a home, and I need a loan." Such applications need what's called keyword spotting . This procedure doesn't try to identify every word but instead looks for patterns that match specified keywords (e.g., mortgage or loan ). If the system hears one of those words, it takes a programmed action.

Talkus Interruptus

In normal speech, we run words together. This so-called continuous speech can be difficult for a speech-recognition system to handle. The most common alternative is discrete-word input , where users pause between words. Discrete-word input simplifies the identification of word boundaries. With a limited vocabulary and a finite-space grammar, continuous-speech recognition doesn't require too much computation. But for a large vocabulary with a statistical model, great power is required.

Existing laboratory systems for continuous-speech dictation can take from three to 10 times as long to process a speech sample as the person takes to say it. Philips Dictation Systems' SpeechMagic, the first commercial continuous-speech dictation system, avoids this dilemma by beginning its processing after the user has finished dictating. Greater commercial use of continuous-speech dictation awaits more powerful, less expensive CPUs.

Do I Know You?

The simplest approach uses HMMs created from samples spoken by one person and are for use only by that person. These are called speaker-dependent systems. Creating the model is called training, or enrollment. Each user must provide at least one spoken sample for each word in the vocabulary. Although this can take considerable up-front time, such systems can recognize users' speech very accurately.

But speaker-dependent systems aren't so good for one-time users or a large vocabulary. These situations call for a speaker-independent system built from samples by many individuals. Although they're less accurate than good speaker-dependent models, such models work surprisingly well. However, speech models created for American English might not work well with British speakers.

When applications demand large vocabularies and are to be used repeatedly by the same people, as for dictation, it's useful to tune the m odels to each speaker. Because users can't enroll thousands of words, large-vocabulary systems begin with primitive word forms called baseforms and modify them using smaller samples of a user's speech. This process is called speaker adaptation .

A common form is called "on-the-fly adaptation." Found in Dragon Systems' DragonDictate, it adjusts to the speaker during use. Another approach, known as "rapid enrollment" and used by IBM's VoiceType dictation system, requires a one-time enrollment process that takes anywhere from 45 minutes to 2 hours.

OK, What's It Good for?

Speech-recognition systems are suited to four primary functions: command and control, data entry, data access and querying, and dictation (see the table "Speech Recognition's Four Main Uses" ). Most often, the nature of the application dictates what type of speech-recognition product and technology should be used and determines what features ar e important.

With command and control, you operate a computer or other device using spoken commands, such as voice-dialing and GUI-navigation systems. The first applications of this type allowed military personnel and factory workers to operate equipment such as map displays in tanks and aircraft.

Voice command and control is now being used in consumer products, including personal digital assistants, VCR programmers, toys, and home appliances. It also gives hands-free control of wheelchairs and other equipment to disabled people.

The telephone is arguably the most popular current platform for speech command and control. Speech also provides a simple, easy-to-use, uniform interface for call management and message-processing operations, and it's an important part of most modern telephony applications. For example, call routing is easier when callers can just say "technical support" or "tech support" to reach the appropriate line.

Most command-and-control systems need small vocabulari es in a simple structure. Many systems require the high accuracy offered by good speaker-dependent models and expect superior noise tolerance. In most cases, commands are short enough to allow either discrete-word or continuous-speech recognition.

Data Entry

A data-entry speech-recognition product is an "eyes busy, hands busy" input device that allows an individual to enter data while performing a demanding manual task. Early applications were in manufacturing jobs, such as inspection, receiving, and quality audit.

Newer applications are appearing in other fields. For example, several systems allow physicians and nurses to enter data while examining patients. Visa Interactive recently deployed a speech interface for bill payment over the phone. Using speech- recognition systems, the U.S. Bureau of Labor Statistics has been able to expand its data-collection capabilities despite a shrinking staff.

Data-entry applications are usually highly structured and can suppor t either discrete or continuous input. Vocabularies can range from small to moderately large; speaker-modeling requirements depend on the size and nature of the user population.

Queries and Data Access

Voice data access is used primarily over the telephone for gathering information from databases and other on-line sources. Banks that wanted to extend their remote services to customers with rotary phones were early users of voice data access. With Touch-Tone technology being rare outside North America, speech recognition permits cost-effective 24-hour support for overseas customers.

The most notable application of voice-activated data access is in information-retrieval systems. For example, both West Publishing and Lexis-Nexis offer speech-recognition interfaces for searching their legal databases. Both companies' products convert spoken queries into SQL statements.

Keyword spotting allows continuous-speech input and speaker-independent modeling for small-vocabular y, telephone-based systems. Database-retrieval systems currently employ discrete-word input and speaker adaptation.

Dictation: Computer, Take a Letter

Dictation comes in two basic forms: structured report generation and free-form dictation. Reporting systems are widely used in health care and are gaining popularity among attorneys.

Dictation systems need big vocabularies -- 20,000 words or more. Free-form dictation requires statistical grammars, but structured report generation can be implemented with finite-state grammars. Current technology relies mainly on discrete-word recognition and speaker adaptation.

Do What I Mean, Not What I Say

Speech-recognition technology is a long way from human communication. While figuring out what words are spoken can help automate many operations, it's still only one part of a larger, more difficult puzzle -- figuring out what a spoken communication means.

A new field of study, known as spoken lang uage understanding (SLU), aims at improving the verbal communication skills of machines. SLU research is driven primarily by the Defense Department's Advanced Research Projects Agency and by government funding from Japan and Europe. Several organizations are working on speech-to-speech translation, even over transoceanic telephone lines. Researchers and commercial companies are developing systems that can handle limited chunks of meaning that are important for natural conversation. We'll see significant advances in the SLU field in the next few years, but full implementation remains a distant goal.

Neural-net technology is also emerging. Sensory Circuits offers a chip-level product used in toys and other consumer products. Lernout and Hauspie (Woburn, MA) is licensing its neural-net technology. This will be instrumental in improving noise immunity and creating more flexible, speaker-independent models.

Finally, support for speech recognition is being provided by the development of API standards. Proposals covering telecommunications platforms, Windows 3.1, and Windows 95 standards have been formulated and are being adopted. By the end of the century, all these technical advances will make today's speech-recognition technology, as good as it is, look primitive.


ACKNOWLEDGMENTS

Some information for this article was provided by Martha Lindeman, Ph.D., president of Users First, Inc. (Columbus, OH), and Bruce Armstrong, manager, Novell Speech Technology (Orem, UT), and chairman of the Speech Recognition Application Programming Interface Standards Committee.


WHERE TO FIND


Hark Recognizer
IBM-compatible PCs; Unix Workstations.................$400 per port

Target Customer: Applications developer, Product developer
BBN Hark Systems Corp.
Cambridge, MA
(617) 873-4636
fax (617) 873-2473
hark-info@bbn.com

http://www.bbn.com



DragonDictate
IBM-compatible PCs (486/33 and up)....................$395 (5K words)
......................................................$695 (30K)
.....................................................$1695 (60K)

Target Customer: End user, OEM
Dragon Systems, Inc.
Newton, MA
(800) 825-5897
(617) 965-5200
fax (617) 527-0372


VoiceType
IBM-compatible PCs....................................Starts at $999

Target Customer: End user
IBM Corp.
Boca Raton, FL
(407) 443-8011
fax (407) 443-6549


Kurzweil Voice for Windows release 1.5
IBM-compatible PCs (486/33 and up)....................$995 (includes sound
                                                            board and
                                                            microphone)

Target Customer: End user
Kurzweil Applied Intelligence
Waltham, MA
(800) 380-1234
(617) 893-5151
fax (617) 893-7653


SpeechMagic
SpeechPro (language development tool)
IBM-compatible PCs (486 and up).......................Consult vendor

Target Customer: Applications developer, Product developer
Philips Dictation Systems
San Franciso, CA
(415) 434-7715
fax (415) 434-7729


RSC-164 Series
Chip-level............................................Under $5 per chip in
                                                       quantity

Target Customer: Product developer
Sensory Circuits
San Jose, CA
(408) 452-1000
fax (408) 452-1025

http://www.sensory.com/



PE500
IBM-compatible PCs (486 and up).......................$995

Target Customer: Applications developer, OEM
Speech Systems, Inc.
Boulder, CO
(303) 938-1110
fax (303) 938-1874


Wuzzatdoonear? Idano

If you think speech recognition is a simple problem, consider the
following as exam
ples of normal, everyday speech, the kind of thing
we hear all the time and never wonder what it means.


hominyuwan?
 (How many do you want?)

amina
 (I'm gonna [borrowed from George Carlin])

jeet?
 (Did you eat?)

wuhjusay?
 (What did you say?)

ahluv
 (All of; I love; I'll have; olive [Take your pick!]) 


This raises the possibility of the following spoken sentence: "Ahluv,
ahluv an ahluv, cuz ahluv ahluv 'em lil greentings."




Speech-Recognition Products


                                       
Technology Used
                                       ===============

              Primary   Continuous (C)  Speaker         Dictionary   Features
Product       Functions Discrete (D)    Dependent (D)   Size (max.
                                        Independent (I) words per
                                        Adaptive (A)    app.)
======
=========================================================================

Hark          1, 2, 3       C		  				I             L (100K;     FSG, HMM
Recognizer   (telephony)                                2K active)

Dragon-       1, 4          D             A             L (60K)      HMM, S
Dictate

Voice-        1, 4          D             A             L (22K)      HMM, S
Type 

Kurzweil      1, 4      C (for digits)    A             L (30K or    Undisclosed;
Voice for               D                                  60K)      thought to
Windows                                                              be FSG,
rel. 1.5                                                             HMM, S

SpeechMagic   4             C             A             L (50K and   HMM, S
SpeechPro                                                  up)

RSC-164       1             D             D, I          S            FSG, neural
Series                                                               network;

                                                             (also does speech
                                                              and music synthe-
                                                              sis, voice re-
                                                              cording)

PE500         1, 2, 3       C             I             L (40K)      FSG, phoneme
                                                                     model


KEY

Technology Used:
Primary Functions: 
  1 = command and control
  2 = data entry
  3 = data access/querying
  4 = dictation/report generation
Dictionary Size:
  S = Small
  M = Medium
  L = Large
Features:
  FSG = Finite-state grammar
  HMM = Hidden Markov models
  S = Statistical language model
                                                          




Speech Recognition's Four Main Uses


COMMAND AND CONTROL

Voice control of machine operations.

Voice-activated dialing; navigation of GUIs.


DATA ENTRY

Input of data to quality-control systems, databases, or other software.
Inspection data; forms completion; order entry.


DICTATION

Creation of letters and other documents using free-form or structured dictation.
General dictation; structured report generation.


DATA ACCESS/INFORMATION RETRIEVAL

Search and retrieval of on-line data.
Banking by phone; directory assistance.




Speech-Recognition Features

Recognition algorithm   Method of representing speech and
                        comparing stored models with user input.

Vocabulary              The number and types of words included
                        in the application. Vocabulary size can
                        range from two words to more than 60,000.

Grammar                 Structure imposed on the application that
                        defines what can be said
 and in what
                        sequence. Possible types include finite-
                        state grammar, statistical language models,
                        keyword spotting, or no grammar.

Speech flow             How a user must speak to the system,
                        either with continuous speech or in discrete
                        words with pauses in between.

Speaker model           How the system gathers information about,
                        and represents, users' acoustic patterns.
                        system can be speaker-dependent,
                        speaker-independent, or speaker-adaptive.




Who's on First

illustration_link (7 Kbytes)

This sample finite-state grammar allows a limited range of sentences to be created. Would it have been simpler before last year's baseball strike?


Judith Markowitz (Chicago, IL) is author of Using Speech Recognition (Prentice-Hall, 1995). You can reach her on BIX c/o "editors" or on the Internet at markwitz@steve.iit.edu .

Up to the Features section contentsGo to next article: Hidden Markov ModelsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network