Speech Recognition Systems

This article was published by ComputorEdge, issue #2202, , as a feature article, in both their print edition (on pages 18-19) and their website.

There are numerous methods for entering information into a computer — keyboards and pointing devices (mice, touch pads, etc.) being the most prevalent. But the efficiency of these methods has always been limited by the skill and speed of the individual computer user. The situation is even more daunting for those people who do not have the use of their fingers for typing or their hands for manipulating mice. A better approach would be for the user to simply talk to the computer, and have it understand what information the user is attempting to enter and what operations the user is trying to perform. But computers per se are not capable of doing that.

During the past few decades, the idea of automatic speech recognition (ASR) has gradually become more of a reality, and no longer just an intriguing but fanciful idea employed by science fiction authors. Those writers envisioned the advantages of humans being able to communicate with their computers and robots, simply through the use of natural language commands. In fact, the public's exposure to the idea was no doubt mostly a result of it being used successfully in movies (such as "2001: A Space Odyssey") and television shows (such as the original "Star Trek" series).

"Hello, Computer"

Like so many other technological advancements, speech recognition was first dreamed up by science fiction writers, but only began to be implemented in reality via academic and military research. The very first work in the field of speech recognition was done at AT&T's Bell Labs, and sponsored by the U.S. Department of Defense, during the 1930s and 1940s. They attempted to develop automatic language translators to expedite the processing of intercepted Soviet messages. Despite the overall failure of this particular project, these early efforts illustrated just how difficult it is to program a computer to automatically translate the phonemes of human speech into the correct words intended by the speaker. This task is made more difficult by the huge vocabulary of the typical computer user, by every word sounding somewhat different from all previous occurrences of it, and by the unavoidable background noise.

During the early 1980s, the technology had progressed to the point where it was ready for the commercial world. Products were brought to market by several companies, including Covox and Dragon Systems. These early attempts at consumer speech recognition systems required that the user employ only "discrete speech" — with each word separated by a pause, in order to make it much easier for the system to distinguish individual words. In terms of accuracy and thus usability, these admittedly pioneering systems were marginally useful. This was partly due to the hardware limitations at that time, primarily inadequate processing power and insufficient system memory.

One solution was to limit the user to a smaller vocabulary than what is used in everyday conversation. This made it easier for the speech software to correctly identify the individual words spoken. For instance, some of the earliest such applications were for medical transcription and automated telephone response. Examples of the latter include today's (unpopular) phone menus, in addition to trade entry systems used by customers of brokerages. In fact, Charles Schwab was the first major consumer company to make use of speech recognition, in the form of a "telephone broker", in 1996.

But as the power of personal computers increased dramatically, so did the dictation capabilities of the newer speech systems. A real breakthrough occurred in 1997, when Dragon Systems introduced NaturallySpeaking, a "continuous speech" recognition program for general-purpose use, with a vocabulary of 23,000 words. This allowed users to dictate to their computers without having to pause between words. IBM responded quickly with their own product, ViaVoice.

Nowadays, state-of-the-art speech recognition systems are able to achieve anywhere from 95 to 99 percent accuracy. Top performance is best gained by a combination of robust hardware (at a minimum for a PC, a Pentium III with 512 MB of RAM), use of a leading speech product and a quality microphone, clear enunciation of words by the user, and training of the system after installation. Fortunately, the training process, which used to take over an hour, can now be done in 10 minutes or less.

"Begin Dictation"

If you have never used speech recognition software before — or if you have at one point in the past, but were dissatisfied with the results — now is the ideal time to give it a try. Not only are typical modern computer systems more than adequate for performing the intensive voice processing, but the speech products themselves perform dramatically better than their predecessors. In addition, all of the major speech systems allow the user to: dictate directly into Microsoft Office products, record and program (time-saving) macros, edit and format text by voice, have the computer read text out loud ("text-to-speech"), teach the system new words and phrases, and use the voice to manipulate menus, dialog boxes, and the mouse pointer.

Speech recognition products are available primarily for Microsoft Windows and Apple Macintosh machines. IBM's ViaVoice is available for Windows, Mac, and handheld computer platforms. However, Windows 2000 and XP Professional users are limited to the more expensive Pro USB Edition. Likewise, ScanSoft's Dragon NaturallySpeaking is supported on Windows and Mac, but not Linux or Unix.

The most popular product is also the one that I personally use and recommend: Dragon NaturallySpeaking. It presently comes in nine editions, some geared towards specific business fields, such as the medical and legal professions. But if you are purchasing NaturallySpeaking for home use, and do not have unlimited financial resources, you may wish to start with their Essentials Edition, which provides the bulk of the dictation capabilities, without extra features that most users would never use.

If you are not entirely comfortable with learning new software systems, and you foresee making extensive use of dictation, then you should consider getting some training from an ergonomic consultant. They can help you install the product, train the system to better recognize your speech patterns, and show you how to make full use of the product features, including the macros. Regardless of whether you get help learning the system, or teach yourself using the manual, you are strongly urged to fully learn the capabilities of your new dictation system — in order to make the best use of it, and to minimize frustration. Also keep in mind that the recognition accuracy will improve the more you use the system.

As you can imagine, dictation software has been a real godsend to physically handicapped people who otherwise would not be able to control various electronic devices, including their computers. Corporate workers can use these dictation systems for creating the bulk of their business documents, including email messages. Home users can easily create email messages and even do instant messaging (IM) purely by voice. These dictation systems are also much appreciated by writers, who can now turn "stream of consciousness" into stream of text, without touching the keyboard or mouse. For example, this entire article has been dictated directly into the simple DragonPad word processor. Thus, if you find any errors in this article, feel free to blame my computer!

Copyright © 2003 Michael J. Ross. All rights reserved.
bad bots block