Developments and Directions in Speech Recognition and Understanding , Part 1 T

T o advance research, it is important to identify promising future research directions , especially those that have not been adequately pursued or funded in the past. The working group producing this article was charged to elicit from the human language technology (HLT) community a set of well-considered directions or rich areas for future research that could lead to major paradigm shifts in the field of automatic speech recognition (ASR) and understanding. ASR has been an area of great interest and activity to the signal processing and HLT communities over the past several decades. As a first step, this group reviewed major developments in the field and the circumstances that led to their success and then focused on areas it deemed especially fertile for future research. Part 1 of this article will focus on historically significant developments in the ASR area, including several major research efforts that were guided by different funding agencies, and suggest general areas in which to focus research. Part 2 (to appear in the next issue) will explore in more detail several new avenues holding promise for substantial improvements in ASR performance. These entail cross-disciplinary research and specific approaches to address three-to-five-year grand challenges aimed at stimulating advanced research by dealing with realistic tasks of broad interest. SIGNIFICANT DEVELOPMENTS IN SPEECH RECOGNITION AND UNDERSTANDING The period since the mid-1970s has witnessed the multidisciplinary field of ASR proceed from its infancy to its coming of age and into a quickly growing number of practical applications and commercial markets. Despite its many achievements, however, ASR still remains far from being a solved problem. As in the past, we expect that further research and development will enable us to create increasingly powerful systems, deploy-able on a worldwide basis. This section briefly reviews highlights of major developments in ASR in five areas: infrastructure, knowledge representation, models and algorithms, search, and metadata. Broader and deeper discussions of these areas can INFRASTRUCTURE Moore's Law observes long-term progress in computer development and predicts doubling the amount of computation achievable for a given cost every 12 to 18 months, as well as a comparably shrinking cost of memory. These developments have been instrumental in enabling ASR researchers to run increasingly complex algorithms in sufficiently short time frames (e.g., meaningful experiments that can be done in less than a day) to make great progress since 1975. The availability of common speech corpora for speech training, …

[1]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[2]  D. A. van Leeuwen,et al.  Speech and Audio Signal Processing , 2011 .

[3]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[4]  J. Baker Spoken Language Digital Libraries : The Million Hour Speech Project , 2008 .

[5]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  John F. Elder The Million Book Digital Library Project: Research Problems in Data Mining and Discovery , 2005 .

[7]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[8]  Douglas D. O'Shaughnessy,et al.  Speech Processing , 2018 .

[9]  Hiroaki Sato,et al.  The FrameNet Database and Software Tools , 2002, LREC.

[10]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[11]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[12]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[14]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[15]  Roland Kuhn,et al.  Eigenvoices for speaker adaptation , 1998, ICSLP.

[16]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[17]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[18]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[20]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[21]  Aaron E. Rosenberg,et al.  Cepstral channel normalization techniques for HMM-based speaker verification , 1994, ICSLP.

[22]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[23]  Janet M. Baker,et al.  Application of large vocabulary continuous speech recognition to topic and speaker identification using telephone speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[25]  Lalit R. Bahl,et al.  Estimating hidden Markov model parameters so as to maximize speech recognition accuracy , 1993, IEEE Trans. Speech Audio Process..

[26]  Douglas B. Paul,et al.  Algorithms for an Optimal A* Search and Linearizing the Search in the Stack Decoder* , 1991, HLT.

[27]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[28]  J. G. Gander,et al.  An introduction to signal detection and estimation , 1990 .

[29]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[30]  古井 貞煕,et al.  Digital speech processing, synthesis, and recognition , 1989 .

[31]  Raj Reddy,et al.  Automatic Speech Recognition: The Development of the Sphinx Recognition System , 1988 .

[32]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[33]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[34]  Ed Marcato Speech Recognition Technology , 1986, MILCOM 1986 - IEEE Military Communications Conference: Communications-Computers: Teamed for the 90's.

[35]  D. Childers,et al.  Two-channel speech analysis , 1986, IEEE Trans. Acoust. Speech Signal Process..

[36]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[37]  Jong Kyoung Kim,et al.  Speech recognition , 1983, 1983 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[38]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[39]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[40]  Hiroaki Sakoe,et al.  A Dynamic Programming Approach to Continuous Speech Recognition , 1971 .

[41]  N. G. Zagoruyko,et al.  Automatic recognition of 200 words , 1970 .

[42]  F. Jelinek Fast sequential decoding algorithm using a stack , 1969 .

[43]  T. K. Vintsyuk Speech discrimination by dynamic programming , 1968 .

[44]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.