To translate speech into text, the IBM Personal Dictation System, or IPDS, employs four distinct but interwoven procedures. The first one is acoustic processing, which extracts usable information from raw audio data. The process also uses an adaptation mechanism to filter out steady-state background audio (e.g., the hum of a computer fan) and to adjust to different microphones. The system collects your raw speech and breaks it down into centisecond (1/100-second) frames. Spectrum analysis determines the distinct frequency characteristics (i.e., feature vectors) of the centisecond frame.
A statistical model, called the Hidden Markov Model, predicts which feature vectors are likely to represent a subphonetic sound (such as the t sound). These subphonemes are called labels. So, for example, the Hidden Markov Model for a t sound will most likely predict t-type labels. Th
e system knows what sounds you are making during training because you are following a known script. It learns how you make a t sound, how you make an a sound, how you make an a sound when it follows a t, and so on.
The next step, acoustic matching, compares the extracted labels to the acoustic models in the dictionary. Every word in the dictionary is broken down into these subphonetic labels, so the labels generated through acoustic processing can be matched to the dictionary entries.
The system does not decide on the best word based on acoustic matching alone. It also employs an adaptive language model to enhance recognition accuracy. The language model is based on unigrams (single words), bigrams (sets of two words), and trigrams (sets of three words). The model maintains data on word usage and knows the probability that any single word or set of words will be used.
For instance, there is a relatively high probability that the word the will be spoken, and a lower probability that the wo
rd creed will be spoken. The system then looks at a pair of words and determines the probability that a particular pair of words will appear together. Next, it considers a set of three words and checks its probability data again. The system constantly refines its recognition of a particular word by looking ahead and back. As you dictate, you can watch the system dynamically alter its word guesses as the frame of reference around that word expands.
The last step of the process, the hypothesis search, combines the results of both the acoustic matching and the language model to determine the most probable word string.
In addition to adding new words to the dictionary as you specify them, the system updates the probability models to reflect your unique word-usage patterns. This adaptive process allows the system to become more accurate as you use it. It also explains why the system works better with documents that share consistent terminology and phraseology: It can better predict what words you are
likely to say if you follow consistent patterns of word usage.