Large vocabulary speech recognition using subword units

Abstract Research in large vocabulary speech recognition has been intensively carried out worldwide, in the past several years, spurred on by advances in algorithms, architectures and hardware. In the United States, the DARPA community has focused efforts on studying several continuous speech recognition tasks including Naval Resource Management, a 991 word task, ATIS (Air Travel Information System), a speech understanding task with an open vocabulary (in practice on the order of several thousand words) and a natural language component, and Wall Street Journal, a voice dictation task with a vocabulary on the order of 20,000 words. Although we have learned a great deal about how to build and efficiently implement large vocabulary speech recognition systems, there remain a whole range of fundamental questions for which we have no definitive answers. In this paper we review the basic structure of a large vocabulary speech recognition system, address the basic system design issues, discuss the considerations in the selection of training material, choice of subword unit, method of training and adaptation of models of subword units, integration of language model, and implementation of the overall system, and report on some recent results, obtained at AT&T Bell Laboratories, on the Resource Management task.

[1]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[2]  Richard Sproat,et al.  Efficient grammar processing for a spoken language translation system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[4]  Aaron E. Rosenberg,et al.  Word juncture modeling using phonological rules for HMM-based continuous speech recognition , 1991 .

[5]  Jean-Luc Gauvain,et al.  Continuous Speech Recognition at LIMSI , 1992 .

[6]  Hsiao-Wuen Hon,et al.  Vocabulary-independent speech recognition: the Vocind System , 1992 .

[7]  Mitch Weintraub,et al.  SRI's DECIPHER System , 1989, HLT.

[8]  Andrej Ljolje,et al.  Optimal speech recognition using phone recognition and lexical access , 1992, ICSLP.

[9]  Li Deng,et al.  Acoustic recognition component of an 86000-word speech recognizer , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  Bruce Lowerre,et al.  The Harpy speech understanding system , 1990 .

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter modeling for speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[13]  Victor Zue,et al.  The MIT SUMMIT Speech Recognition System: A Progress Report , 1989, HLT.

[14]  Chin-Hui Lee,et al.  MAP Estimation of Continuous Density HMM : Theory and Applications , 1992, HLT.

[15]  Pietro Laface,et al.  Lexical access to large vocabularies for speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[16]  Chin-Hui Lee,et al.  Bayesian learning for hidden Markov model with Gaussian mixture state observation densities , 1991, Speech Commun..

[17]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[18]  D. B. Paul,et al.  The Lincoln robust continuous speech recognizer , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[19]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[20]  Chin-Hui Lee,et al.  A speech understanding system based on statistical representation of semantics , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[22]  Chin-Hui Lee,et al.  Factorization of Language Constraints in Speech Recognition , 1991, ACL.

[23]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[24]  Hy Murveit,et al.  Linguistic constraints in hidden Markov model based speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[25]  Aaron E. Rosenberg,et al.  Improved acoustic modeling for large vocabulary continuous speech recognition , 1992 .

[26]  Renato De Mori,et al.  A Speech Understanding System With Learning Capability , 1975, IJCAI.

[27]  Lalit R. Bahl,et al.  Experiments with the Tangora 20,000 word speech recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[29]  Chin-Hui Lee,et al.  Improved acoustic modeling with Bayesian learning , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Frank K. Soong,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[31]  Francis Kubala,et al.  New uses for the N-Best sentence hypotheses within the BYBLOS speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Janet M. Baker,et al.  Large Vocabulary Recognition of Wall Street Journal Sentences at Dragon Systems , 1992, HLT.

[33]  Frederick Jelinek,et al.  The development of an experimental discrete dictation recognizer , 1985 .

[34]  Lynette Hirschman,et al.  Multi-Site Data Collection for a Spoken Language Corpus , 1992, HLT.

[35]  Chin-Hui Lee,et al.  Acoustic modeling for large vocabulary speech recognition , 1990 .

[36]  Aaron E. Rosenberg,et al.  Experiments in automatic talker verification using sub-word unit hidden Markov models , 1990, ICSLP.

[37]  Richard M. Schwartz,et al.  The BBN BYBLOS Continuous Speech Recognition System , 1989, HLT.

[38]  Hermann Ney,et al.  Phoneme-based continuous speech recognition results for different language models in the 1000-word spicos system , 1988, Speech Commun..