1 / 34
文档名称:

IEEE - Interacting with Computers by Voice- Automatic Speech Recognition and Synthesis.pdf

格式:pdf   页数:34
下载后只包含 1 个 PDF 格式的文档,没有任何的图纸或源代码,查看文件列表

如果您已付费下载过本站文档,您可以点这里二次下载

IEEE - Interacting with Computers by Voice- Automatic Speech Recognition and Synthesis.pdf

上传人:kuo08091 2014/1/13 文件大小:0 KB

下载得到文件列表

IEEE - Interacting with Computers by Voice- Automatic Speech Recognition and Synthesis.pdf

文档介绍

文档介绍:Interacting puters by Voice: Automatic
Speech Recognition and Synthesis
DOUGLAS O’SHAUGHNESSY, SENIOR MEMBER, IEEE
Invited Paper
This paper examines how municate puters audition, smell, and touch. municate with our environ-
using speech. Automatic speech recognition (ASR) transforms ment, we send out signals or information visually, auditorily,
speech into text, while automatic speech synthesis [or text-to-speech and through gestures. The primary means munication
(TTS)] performs the reverse task. ASR has largely developed based
on speech coding theory, while simulating certain spectral analyses are visual and auditory. puter interactions often
performed by the ear. Typically, a Fourier transform is employed, use a mouse and keyboard as machine input, and puter
but following the auditory Bark scale and simplifying the spectral screen or printer as output. Speech, however, has always had
representation with a decorrelation into cepstral coefficients. a high priority in munication, developed long be-
Current ASR provides good accuracy and performance on limited fore writing. In terms of efficiency munication band-
practical tasks, but exploits only the most rudimentary knowledge width, speech pales before images in any quantitative mea-
about human production and perception phenomena. The popular
mathematical model called the hidden Markov model (HMM) is sure; ., one can read text and understand images much more
examined; first-order HMMs are efficient but ignore long-range quickly on a two- dimensional (2-D) computer screen than
correlations in actual speech. Common language models use a when listening to a [one-dimensional (1-D)] speech signal.
time window of three essive words in their syntactic–semantic However, most people can speak more quickly than they can
analysis. type, and are much fortable speaking than typing.
Speech synthesis is the automatic generation of a speech wave-
form, typically from an input text. As with ASR, TTS starts