The ultimate computer input device may be right under your nose
Judith Markowitz
You don't see the crew of the
Starship Enterprise
fussing much with keyboards. When someone wants to ask the computer a question, he or she normally just speaks to it. It's fast, efficient, and natural -- in
Star Trek
's twenty-fourth-century universe. In our own time, however, keyboards and mice are a lot more important for computing than our voices are.
But talking to machines is too good an idea to ignore. Speech is such a basic and universal mode of communication that it's natural to want to talk to machines, such as computers and telephones.
What Did You Say?
Whether we use it to dial a telephone, navigate through Windows, dictate a letter, or enter data, speech recognition's
basic job remains the same: to identify what a person has said, and to do so quickly, accurately, and seamlessly. It has to identify features from a continuous blast of speech and noise that spans the entire spectrum of audible frequencies.
The task is complicated by regional accents and speech habits (see the table
"Wuzzatdoonear? Idano"
). We rarely notice them because we use nonverbal and situational cues to help us. Speech-recognition systems depend almost exclusively on acoustic data, yet we still expect them to perform as accurately as we do.
Several features influence the accuracy and speed of a speech-recognition system: the recognition algorithms used, the size and nature of the vocabulary, the grammar, whether speech is continuous or discrete, and the speaker model. These are summarized in the table
"Speech-Recognition Features"
.
I Recognize that Algorithm
Speech-recognition systems compare stored vocabulary
models with spoken input according to specified recognition algorithms. No match will be exact, because slight differences in speed, emphasis, emotion, and other details change a word's acoustic patterns and length, even with just one speaker.
Speech-recognition systems represent words in different ways. Some systems use templates, which encode acoustic patterns from one or more samples and then compare acoustic patterns with spoken input, frame by frame. Most products, however, use hidden Markov models, or HMMs (see the sidebar "Hidden Markov Models"). Two recognition algorithms commonly used with HMMs are the Baum-Welch maximum likelihood ("best match") algorithm and the Viterbi ("best path") algorithm. Both process the input through an HMM and produce a probability rating.
HMMs are fast, efficient, and accurate, but the industry and the technology are evolving rapidly, and developers are investigating alternative approaches. One option,
auditory modeling
, attempts to reproduce operat
ions of the inner ear and auditory nerve. Test systems have improved accuracy, speaker modeling, and noise rejection. Unfortunately, human auditory behavior is poorly understood, and full auditory models remain a long-range goal.
Artificial neural networks have begun to appear in commercial speech recognition. Neural nets can extract complex patterns from large quantities of messy data, which makes them well-suited for speech.
How Many Words Is Enough?
Even dictionaries with 100,000 words can't meet all needs, so some products allow users and developers to add words. The newest tools create new vocabulary items by combining HMMs. A system could construct the word
unbirthday
, for example, by assembling HMMs for each of its sounds. Or it could extract
un
from unbolt and attach it to
birthday
.
But large vocabularies can reduce performance or increase complexity. Searching a large vocabulary for every word takes time and increases the likeli
hood that similar-sounding words will produce errors. Acoustics alone can't prevent a system from recognizing
to
when a user means
two
, or from selecting
write
instead of
right
.
Grammar Knows Best
Position and context can help in picking the right word. Language has an internal structure, which we refer to as grammar. For English, that structure limits word sequencing. Speech-recognition systems also use grammars to reduce or eliminate unacceptable word sequences.
The most common grammar for speech recognition is the
finite-state grammar
, which consists of a set of states connected by transitions, like HMMs without the probabilities. A finite-state grammar defines the paths a user can take through the application and specifies what words are acceptable at each state (known as its
active vocabulary
). Limiting the active vocabulary speeds processing and helps minimize errors. Consider the figure
"Who's o
n First?"
. This finite-state grammar has 14 words or phrases and specifies the active vocabulary.
Finite-state grammars are excellent for highly structured applications, such as inspections in manufacturing and voice control of a GUI, but they don't allow the freedom needed for unstructured dictation. For that application, statistical language models work better. These models contain probabilities about how likely it is that a particular word was uttered, given the identity of the preceding word (bigram model) or two words (trigram model).
Many telephone applications have to deal with unpredictable input. For example, a bank customer looking for a home loan might say something like: "I want to find out about mortgages" or "I wanna buy a home, and I need a loan." Such applications need what's called
keyword spotting
. This procedure doesn't try to identify every word but instead looks for patterns that match specified keywords (e.g.,
mortgage
or
loan
). If the system
hears one of those words, it takes a programmed action.
Talkus Interruptus
In normal speech, we run words together. This so-called
continuous speech
can be difficult for a speech-recognition system to handle. The most common alternative is
discrete-word input
, where users pause between words. Discrete-word input simplifies the identification of word boundaries. With a limited vocabulary and a finite-space grammar, continuous-speech recognition doesn't require too much computation. But for a large vocabulary with a statistical model, great power is required.
Existing laboratory systems for continuous-speech dictation can take from three to 10 times as long to process a speech sample as the person takes to say it. Philips Dictation Systems' SpeechMagic, the first commercial continuous-speech dictation system, avoids this dilemma by beginning its processing after the user has finished dictating. Greater commercial use of continuous-speech dictation awaits
more powerful, less expensive CPUs.
Do I Know You?
The simplest approach uses HMMs created from samples spoken by one person and are for use only by that person. These are called
speaker-dependent
systems. Creating the model is called training, or enrollment. Each user must provide at least one spoken sample for each word in the vocabulary. Although this can take considerable up-front time, such systems can recognize users' speech very accurately.
But speaker-dependent systems aren't so good for one-time users or a large vocabulary. These situations call for a speaker-independent system built from samples by many individuals. Although they're less accurate than good speaker-dependent models, such models work surprisingly well. However, speech models created for American English might not work well with British speakers.
When applications demand large vocabularies and are to be used repeatedly by the same people, as for dictation, it's useful to tune the m
odels to each speaker. Because users can't enroll thousands of words, large-vocabulary systems begin with primitive word forms called
baseforms
and modify them using smaller samples of a user's speech. This process is called
speaker adaptation
.
A common form is called "on-the-fly adaptation." Found in Dragon Systems' DragonDictate, it adjusts to the speaker during use. Another approach, known as "rapid enrollment" and used by IBM's VoiceType dictation system, requires a one-time enrollment process that takes anywhere from 45 minutes to 2 hours.
OK, What's It Good for?
Speech-recognition systems
are suited to four primary functions: command and control, data entry, data access and querying, and dictation (see the table
"Speech Recognition's Four Main Uses"
). Most often, the nature of the application dictates what type of speech-recognition product and technology should be used and determines what features ar
e important.
With command and control, you operate a computer or other device using spoken commands, such as voice-dialing and GUI-navigation systems. The first applications of this type allowed military personnel and factory workers to operate equipment such as map displays in tanks and aircraft.
Voice command and control is now being used in consumer products, including personal digital assistants, VCR programmers, toys, and home appliances. It also gives hands-free control of wheelchairs and other equipment to disabled people.
The telephone is arguably the most popular current platform for speech command and control. Speech also provides a simple, easy-to-use, uniform interface for call management and message-processing operations, and it's an important part of most modern telephony applications. For example, call routing is easier when callers can just say "technical support" or "tech support" to reach the appropriate line.
Most command-and-control systems need small vocabulari
es in a simple structure. Many systems require the high accuracy offered by good speaker-dependent models and expect superior noise tolerance. In most cases, commands are short enough to allow either discrete-word or continuous-speech recognition.
Data Entry
A data-entry speech-recognition product is an "eyes busy, hands busy" input device that allows an individual to enter data while performing a demanding manual task. Early applications were in manufacturing jobs, such as inspection, receiving, and quality audit.
Newer applications are appearing in other fields. For example, several systems allow physicians and nurses to enter data while examining patients. Visa Interactive recently deployed a speech interface for bill payment over the phone. Using speech- recognition systems, the U.S. Bureau of Labor Statistics has been able to expand its data-collection capabilities despite a shrinking staff.
Data-entry applications are usually highly structured and can suppor
t either discrete or continuous input. Vocabularies can range from small to moderately large; speaker-modeling requirements depend on the size and nature of the user population.
Queries and Data Access
Voice data access is used primarily over the telephone for gathering information from databases and other on-line sources. Banks that wanted to extend their remote services to customers with rotary phones were early users of voice data access. With Touch-Tone technology being rare outside North America, speech recognition permits cost-effective 24-hour support for overseas customers.
The most notable application of voice-activated data access is in information-retrieval systems. For example, both West Publishing and Lexis-Nexis offer speech-recognition interfaces for searching their legal databases. Both companies' products convert spoken queries into SQL statements.
Keyword spotting allows continuous-speech input and speaker-independent modeling for small-vocabular
y, telephone-based systems. Database-retrieval systems currently employ discrete-word input and speaker adaptation.
Dictation: Computer, Take a Letter
Dictation comes in two basic forms: structured report generation and free-form dictation. Reporting systems are widely used in health care and are gaining popularity among attorneys.
Dictation systems need big vocabularies -- 20,000 words or more. Free-form dictation requires statistical grammars, but structured report generation can be implemented with finite-state grammars. Current technology relies mainly on discrete-word recognition and speaker adaptation.
Do What I Mean, Not What I Say
Speech-recognition technology is a long way from human communication. While figuring out what words are spoken can help automate many operations, it's still only one part of a larger, more difficult puzzle -- figuring out what a spoken communication means.
A new field of study, known as spoken lang
uage understanding (SLU), aims at improving the verbal communication skills of machines. SLU research is driven primarily by the Defense Department's Advanced Research Projects Agency and by government funding from Japan and Europe. Several organizations are working on speech-to-speech translation, even over transoceanic telephone lines. Researchers and commercial companies are developing systems that can handle limited chunks of meaning that are important for natural conversation. We'll see significant advances in the SLU field in the next few years, but full implementation remains a distant goal.
Neural-net technology is also emerging. Sensory Circuits offers a chip-level product used in toys and other consumer products. Lernout and Hauspie (Woburn, MA) is licensing its neural-net technology. This will be instrumental in improving noise immunity and creating more flexible, speaker-independent models.
Finally, support for speech recognition is being provided by the development of API standards.
Proposals covering telecommunications platforms, Windows 3.1, and Windows 95 standards have been formulated and are being adopted. By the end of the century, all these technical advances will make today's speech-recognition technology, as good as it is, look primitive.
ACKNOWLEDGMENTS
Some information for this article was provided by Martha Lindeman, Ph.D., president of Users First, Inc. (Columbus, OH), and Bruce Armstrong, manager, Novell Speech Technology (Orem, UT), and chairman of the Speech Recognition Application Programming Interface Standards Committee.
WHERE TO FIND
Hark Recognizer
IBM-compatible PCs; Unix Workstations.................$400 per port
Target Customer: Applications developer, Product developer
BBN Hark Systems Corp.
Cambridge, MA
(617) 873-4636
fax (617) 873-2473
hark-info@bbn.com
http://www.bbn.com
DragonDictate
IBM-compatible PCs (486/33 and up)....................$395 (5K words)
......................................................$695 (30K)
.....................................................$1695 (60K)
Target Customer: End user, OEM
Dragon Systems, Inc.
Newton, MA
(800) 825-5897
(617) 965-5200
fax (617) 527-0372
VoiceType
IBM-compatible PCs....................................Starts at $999
Target Customer: End user
IBM Corp.
Boca Raton, FL
(407) 443-8011
fax (407) 443-6549
Kurzweil Voice for Windows release 1.5
IBM-compatible PCs (486/33 and up)....................$995 (includes sound
board and
microphone)
Target Customer: End user
Kurzweil Applied Intelligence
Waltham, MA
(800) 380-1234
(617) 893-5151
fax (617) 893-7653
SpeechMagic
SpeechPro (language development tool)
IBM-compatible PCs (486 and up).......................Consult vendor
Target Customer: Applications developer, Product developer
Philips Dictation Systems
San Franciso, CA
(415) 434-7715
fax (415) 434-7729
RSC-164 Series
Chip-level............................................Under $5 per chip in
quantity
Target Customer: Product developer
Sensory Circuits
San Jose, CA
(408) 452-1000
fax (408) 452-1025
http://www.sensory.com/
PE500
IBM-compatible PCs (486 and up).......................$995
Target Customer: Applications developer, OEM
Speech Systems, Inc.
Boulder, CO
(303) 938-1110
fax (303) 938-1874
If you think speech recognition is a simple problem, consider the
following as exam
ples of normal, everyday speech, the kind of thing
we hear all the time and never wonder what it means.
hominyuwan?
(How many do you want?)
amina
(I'm gonna [borrowed from George Carlin])
jeet?
(Did you eat?)
wuhjusay?
(What did you say?)
ahluv
(All of; I love; I'll have; olive [Take your pick!])
This raises the possibility of the following spoken sentence: "Ahluv,
ahluv an ahluv, cuz ahluv ahluv 'em lil greentings."
Technology Used
===============
Primary Continuous (C) Speaker Dictionary Features
Product Functions Discrete (D) Dependent (D) Size (max.
Independent (I) words per
Adaptive (A) app.)
======
=========================================================================
Hark 1, 2, 3 C I L (100K; FSG, HMM
Recognizer (telephony) 2K active)
Dragon- 1, 4 D A L (60K) HMM, S
Dictate
Voice- 1, 4 D A L (22K) HMM, S
Type
Kurzweil 1, 4 C (for digits) A L (30K or Undisclosed;
Voice for D 60K) thought to
Windows be FSG,
rel. 1.5 HMM, S
SpeechMagic 4 C A L (50K and HMM, S
SpeechPro up)
RSC-164 1 D D, I S FSG, neural
Series network;
(also does speech
and music synthe-
sis, voice re-
cording)
PE500 1, 2, 3 C I L (40K) FSG, phoneme
model
KEY
Technology Used:
Primary Functions:
1 = command and control
2 = data entry
3 = data access/querying
4 = dictation/report generation
Dictionary Size:
S = Small
M = Medium
L = Large
Features:
FSG = Finite-state grammar
HMM = Hidden Markov models
S = Statistical language model
COMMAND AND CONTROL
Voice control of machine operations.
Voice-activated dialing; navigation of GUIs.
DATA ENTRY
Input of data to quality-control systems, databases, or other software.
Inspection data; forms completion; order entry.
DICTATION
Creation of letters and other documents using free-form or structured dictation.
General dictation; structured report generation.
DATA ACCESS/INFORMATION RETRIEVAL
Search and retrieval of on-line data.
Banking by phone; directory assistance.
Recognition algorithm Method of representing speech and
comparing stored models with user input.
Vocabulary The number and types of words included
in the application. Vocabulary size can
range from two words to more than 60,000.
Grammar Structure imposed on the application that
defines what can be said
and in what
sequence. Possible types include finite-
state grammar, statistical language models,
keyword spotting, or no grammar.
Speech flow How a user must speak to the system,
either with continuous speech or in discrete
words with pauses in between.
Speaker model How the system gathers information about,
and represents, users' acoustic patterns.
system can be speaker-dependent,
speaker-independent, or speaker-adaptive.
illustration_link (7 Kbytes)

This sample finite-state grammar allows a limited range of sentences to be created. Would it have been simpler before last year's baseball strike?
Judith Markowitz (Chicago, IL) is author of
Using Speech Recognition
(Prentice-Hall, 1995). You can reach her on BIX c/o "editors" or on the Internet at
markwitz@steve.iit.edu
.