also on the speaker's emotional situation, making speech recognition a complicated task for developers.
Continuous speech dictation software has been available for the last 18 months but has been limited to around 25,000 words and profession-specific vocabularies, such as for radiologists (e.g., IBM's MedSpeak). Now continuous systems such as Dragon's NaturallySpeaking are starting to replace existing systems that usually work only with discrete speech punctuated with pauses or that are limited in vocabulary.
Simply put, these new systems do the same as humans do, albeit primitively. They separate speech into words or phonemes (the basic building blocks of speech), compare the acoustic patterns of the speech with the patterns stored in a database, and find the most likely word.
General-purpose dictation software such as Dragon Dictate, IBM VoiceType, and Kurzweil AI Voice typically come with up to 60,000-word vocabularies and the ability to add new words. However, they
cannot cope with input at natural talking speeds (limited to about 100 words per minute), and they require the user to punctuate sentences with short pauses. They usually "understand" straight out of the box, although they work better when given the chance to adapt to a regular user's speech patterns and learn frequently used words.
Today's dictation software, when adapted to a user's speaking characteristics and optimized for certain contexts, achieves around 95 percent accuracy. However, the ultimate aim is for all systems to be speaker-independent and multilingual.
Dragon's new NaturallySpeaking, one of the first general-purpose continuous speech dictation packages, is an example of software that heralds the next generation in computer dictation. Although the first version doesn't allow the speaker to dictate into other applications, you can paste recognized text into other software. Also, it does not include the command-and-control features that come with some discrete dictation packages for v
ocally navigating around the computer, opening and closing applications, and even surfing the Net hands-free.
Processing natural speech eats up a lot of computing power, and this is one reason why it has taken until now for viable commercial products to hit the shelves. "When our first system was developed in 1993, the processor power of a PC was not sufficient to run natural continuous speech recognition," says Ralph Preclik, communications director of Philips Speech Processing. "We had to develop a dedicated accelerator board at that time."
With the introduction of Pentium Pro and MMX technology, speech recognition applications are now running straight from the CPU without a dedicated DSP to perform the signal-processing analysis. According to Preclik, the bottlenecks in speech recognition are now related to other factors; for example, the insufficient display speed of word processing applications.
Most new (and also many earlier) speech recognition applications not only require high compu
ting power but also a minimum of 32 MB of RAM. However, in an embedded-system environment such as a mobile phone, algorithms have to get along with much reduced system resources and perform one- or two-word recognition at best.
Speech recognition algorithms that identify your utterances as a sequence of whole words are usually very fast. But they require more training and greater processing power. Therefore, they apply very well to small-vocabulary applications such as command/control or hands-free phone dialing.
On the other hand, algorithms that recognize phonemes, the basic building blocks of spoken language, are usually more compact and flexible. Phoneme-based algorithms allow for the addition of new words to a vocabulary by identifying and combining existing phonemes. (Most languages have between 30 and 60 phonemes, so the number of combinations is huge but manageable.)
An automated directory-inquiry system, which can retrieve, for example, a name without linguistic context, is a typica
l application of phoneme-based algorithms. Phonetic Systems' Phonetic Database Server, for example, uses such algorithms for speech recognition and rapid searching of very large databases. It can currently handle databases containing 100,000 names, but the company aims to have search capabilities of one million entries by the middle of next year.
Both types of algorithms reinterpret the signal phonetically and match it with its database of acoustic samples by allotting probability scores to possible word matches. Hidden Markov Modeling, based on a two-stage probabilistic process, is currently the most popular statistical modeling technique used for allotting such scores. Alternative models that use neural networks do not perform as well as Hidden Markov Models (HMMs). Says Philips' Ralph Preclik, "Today neural nets can gain acceptable performance only in combination with HMMs."
Acoustic matching produces the most likely phonemes or words, but this is not the end. Words can be spoken in different w
ays, at different speeds, so intelligence is needed to make the leap from a combination of phonemes into actual words or sentences. This process is called linguistic matching. The speech engine then emerges with what it considers to be the most likely word that was spoken.
Multiple Languages
The Holy Grail of computer linguists is a language-independent, speaker-independent, continuous speech recognition interface. Lernout & Hauspie's Language Factory, a software development kit (SDK), helps developers move closer to this paradigm. This suite of multilingual, speaker-independent technologies -- which includes components for automatic speech recognition, text-to-speech conversion, translation, and digital speech compression -- is tailored to small and medium vocabularies. It has already been implemented in a variety of areas, such as language-learning software, voice verification systems, and car navigation.
Lernout & Hauspie's SDK probably has the widest range of supported langua
ges. Its products are available in U.S. English, U.K. English, French, German, Italian, Cantonese, Dutch, Korean, Malay, and Spanish. Japanese, Mandarin, Portuguese, and Russian versions are currently under development.
Building a speech recognition engine in multiple languages requires a lot of resources because you need to collect a large database of speech samples first, including all accents, dialects, and the unique sounds in that language. "This is quite a lengthy process, not least because you must have recordings of several hundred speakers to be able to produce a good model," says Richard Winski, manager of language resources and technology at Vocalis Group. "With access to a suitable database, however, you can normally add a new language in a few weeks."
Each new language presents a unique challenge. "You have to devote a lot of resources to the peculiarities of each language," says Hunt of Dragon Systems U.K. English, for example, is difficult to synthesize because pronunciation is not
always obvious from the way a word is spelled. French, on the other hand, is more difficult to recognize. The French verb
appeller
(to call), for example, can be spelled 12 different ways yet pronounced identically. In German, compound words are difficult to deal with, and the various Chinese dialects differ largely in tone, which isn't an issue in European languages. A case in point is the Chinese word
ma
, which can have five different meanings, depending on intonation.
One of the first companies to rise to the Chinese language challenge was Motorola's Lexicus division. Discrete speech recognition software has been very difficult for Chinese because word boundaries are sometimes ambiguous. As a result, speedy recognition in Chinese wasn't possible until continuous systems worked well enough. Motorola's Chinese continuous speech recognition engine, released late last year, can now recognize over 10,000 spoken words running on a standard PC. That's good news for the 20 percent of the world'
s population that speaks Chinese.
Where to Find
Dragon Systems
Bishops Cleeve, Cheltenham, U.K.
Phone: +44-1242-678-575
Fax: +44-1242-678-301
E-mail:
info@dragonsys.com
Internet:
http://www.dragonsys.com