Speech Input and Output

When a user speaks to a conversational interface, the system has to be able to recognize what was said. The automatic speech recognition (ASR) component processes the acoustic signal that represents the spoken utterance and outputs a sequence of word hypotheses, thus transforming the speech into text. The other side of the coin is text-to-speech synthesis (TTS), in which written text is transformed into speech. There has been extensive research in both these areas, and striking improvements have been made over the past decade. In this chapter, we provide an overview of the processes of ASR and TTS.

[1]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[3]  David Suendermann,et al.  Challenges in Speech Synthesis , 2010 .

[4]  Sadaoki Furui,et al.  History and Development of Speech Recognition , 2010 .

[5]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[6]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[7]  James R. Lewis,et al.  Practical Speech User Interface Design , 2010 .

[8]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[9]  Stephen E. Levinson,et al.  Mathematical Models for Speech Technology , 2005 .

[10]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[11]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[12]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[13]  Xiao Li,et al.  Machine Learning Paradigms for Speech Recognition: An Overview , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[15]  Marcos Faundez-Zanuy,et al.  Recent Advances in Nonlinear Speech Processing , 2016, Smart Innovation, Systems and Technologies.

[16]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[17]  Wendy J. Holmes,et al.  Speech Synthesis and Recognition , 1988 .

[18]  James R. Glass,et al.  Developments and directions in speech recognition and understanding, Part 1 [DSP Education] , 2009, IEEE Signal Processing Magazine.

[19]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[20]  Roberto Pieraccini The Voice in the Machine: Building Computers That Understand Speech , 2012 .

[21]  Wu Chou,et al.  Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.