Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesLearning to Listen


May 1994 / Reviews / Learning to Listen

To translate speech into text, the IBM Personal Dictation System, or IPDS, employs four distinct but interwoven procedures. The first one is acoustic processing, which extracts usable information from raw audio data. The process also uses an adaptation mechanism to filter out steady-state background audio (e.g., the hum of a computer fan) and to adjust to different microphones. The system collects your raw speech and breaks it down into centisecond (1/100-second) frames. Spectrum analysis determines the distinct frequency characteristics (i.e., feature vectors) of the centisecond frame.

A statistical model, called the Hidden Markov Model, predicts which feature vectors are likely to represent a subphonetic sound (such as the t sound). These subphonemes are called labels. So, for example, the Hidden Markov Model for a t sound will most likely predict t-type labels. Th e system knows what sounds you are making during training because you are following a known script. It learns how you make a t sound, how you make an a sound, how you make an a sound when it follows a t, and so on.

The next step, acoustic matching, compares the extracted labels to the acoustic models in the dictionary. Every word in the dictionary is broken down into these subphonetic labels, so the labels generated through acoustic processing can be matched to the dictionary entries.

The system does not decide on the best word based on acoustic matching alone. It also employs an adaptive language model to enhance recognition accuracy. The language model is based on unigrams (single words), bigrams (sets of two words), and trigrams (sets of three words). The model maintains data on word usage and knows the probability that any single word or set of words will be used.

For instance, there is a relatively high probability that the word the will be spoken, and a lower probability that the wo rd creed will be spoken. The system then looks at a pair of words and determines the probability that a particular pair of words will appear together. Next, it considers a set of three words and checks its probability data again. The system constantly refines its recognition of a particular word by looking ahead and back. As you dictate, you can watch the system dynamically alter its word guesses as the frame of reference around that word expands.

The last step of the process, the hypothesis search, combines the results of both the acoustic matching and the language model to determine the most probable word string.

In addition to adding new words to the dictionary as you specify them, the system updates the probability models to reflect your unique word-usage patterns. This adaptive process allows the system to become more accurate as you use it. It also explains why the system works better with documents that share consistent terminology and phraseology: It can better predict what words you are likely to say if you follow consistent patterns of word usage.


Up to the Reviews section contentsGo to previous article: Desktop DictationGo to next article: Printer at WorkSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network