From Sphinx-II to Whisper — Making Speech Recognition Usable

In this chapter, we first review Sphinx-II, a large-vocabulary speaker-independent continuous speech recognition system developed at Carnegie Mellon University, summarizing the techniques that helped Sphinx-II achieve state-of-the-art recognition performance. We then review Whisper, a system we developed here at Microsoft Corporation, focusing on recognition accuracy, efficiency and usability issues. These three issues are critical to the success of commercial speech applications. Whisper has significantly improved its performance in these three areas. It can be configured as a spoken language front-end (telephony or desktop) or dictation application.

[1]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[2]  K.F. Lee,et al.  On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition , 1993, IEEE Trans. Speech Audio Process..

[3]  Mei-Yuh Hwang,et al.  An Overview of the SPHINX-II Speech Recognition System , 1993, HLT.

[4]  Francis Kubala,et al.  New uses for the N-Best sentence hypotheses within the BYBLOS speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Kai-Fu Lee,et al.  The conversational computer: an apple perspective , 1993, EUROSPEECH.

[6]  Mei-Yuh Hwang,et al.  An improved search algorithm using incremental knowledge for continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  H. Ney,et al.  Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Hermann Ney,et al.  Modeling and search in continuous speech recognition , 1993, EUROSPEECH.

[9]  X. D. Huang,et al.  Phoneme classification using semicontinuous hidden Markov models , 1992, IEEE Trans. Signal Process..

[10]  Pietro Laface,et al.  Speech Recognition and Understanding: Recent Advances, Trends, and Applications , 1997 .

[11]  Fileno A. Alleva Search Organization for Large Vocabulary Continuous Speech Recognition , 1992 .

[12]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[13]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[14]  Mei-Yuh Hwang,et al.  Shared-distribution hidden Markov models for speech recognition , 1993, IEEE Trans. Speech Audio Process..

[15]  J. Makhoul,et al.  Automatic modeling for adding new words to a large-vocabulary continuous speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[17]  Jonathan G. Fiscus,et al.  Benchmark Tests for the DARPA Spoken Language Program , 1993, HLT.

[18]  Mei-Yuh Hwang,et al.  Predicting unseen triphones with senones , 1996, IEEE Trans. Speech Audio Process..

[19]  Richard M. Stern,et al.  Efficient Cepstral Normalization for Robust Speech Recognition , 1993, HLT.

[20]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[21]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[22]  L. R. Bahl Language-model/acoustic channel balance mechanism , 1980 .

[23]  Xuedong Huang Minimizing Speaker Variation Effects for Speaker-Independent Speech Recognition , 1992, HLT.

[24]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[25]  Leonard R. Marino Principles of computer design , 1986 .

[26]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[27]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[29]  Victor Zue,et al.  Recent Progress on the VOYAGER System , 1990, HLT.

[30]  Mei-Yuh Hwang,et al.  Unified stochastic engine (USE) for speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Mei-Yuh Hwang,et al.  Modeling between-word coarticulation in continuous speech recognition , 1989, EUROSPEECH.

[32]  Bruce Lowerre,et al.  The Harpy speech understanding system , 1990 .

[33]  Mei-Yuh Hwang,et al.  Subphonetic modeling with Markov states-Senone , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .