文档介绍:Interacting puters by Voice: Automatic
Speech Recognition and Synthesis
DOUGLAS O’SHAUGHNESSY, SENIOR MEMBER, IEEE
Invited Paper
This paper examines how municate puters audition, smell, and touch. municate with our environ-
using speech. Automatic speech recognition (ASR) transforms ment, we send out signals or information visually, auditorily,
speech into text, while automatic speech synthesis [or text-to-speech and through gestures. The primary means munication
(TTS)] performs the reverse task. ASR has largely developed based
on speech coding theory, while simulating certain spectral analyses are visual and auditory. puter interactions often
performed by the ear. Typically, a Fourier transform is employed, use a mouse and keyboard as machine input, and puter
but following the auditory Bark scale and simplifying the spectral screen or printer as output. Speech, however, has always had
representation with a decorrelation into cepstral coefficients. a high priority in munication, developed long be-
Current ASR provides good accuracy and performance on limited fore writing. In terms of efficiency munication band-
practical tasks, but exploits only the most rudimentary knowledge width, speech pales before images in any quantitative mea-
about human production and perception phenomena. The popular
mathematical model called the hidden Markov model (HMM) is sure; ., one can read text and understand images much more
examined; first-order HMMs are efficient but ignore long-range quickly on a two- dimensional (2-D) computer screen than
correlations in actual speech. Common language models use a when listening to a [one-dimensional (1-D)] speech signal.
time window of three essive words in their syntactic–semantic However, most people can speak more quickly than they can
analysis. type, and are much fortable speaking than typing.
Speech synthesis is the automatic generation of a speech wave-
form, typically from an input text. As with ASR, TTS starts