High-Accuracy Large-Vocabulary Speech Recognition Using Mixture Tying and Consistency Modeling

Improved acoustic modeling can significantly decrease the error rate in large-vocabulary speech recognition. Our approach to the problem is twofold. We first propose a scheme that optimizes the degree of mixture tying for a given amount of training data and computational resources. Experimental results on the Wall Street Journal (WSJ) Corpus show that this new form of output distribution achieves a 25% reduction in error rate over typical tied-mixture systems. We then show that an additional improvement can be achieved by modeling local time correlation with linear discriminant features.

[1]  Jonathan G. Fiscus,et al.  Benchmark Tests for the DARPA Spoken Language Program , 1993, HLT.

[2]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Vassilios Digalakis,et al.  Genones: optimizing the degree of mixture tying in a large vocabulary hidden Markov model based speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Chin-Hui Lee,et al.  Acoustic modeling for large vocabulary speech recognition , 1990 .

[6]  Xuedong Huang,et al.  Performance comparison between semicontinuous and discrete hidden Markov models of speech , 1988 .

[7]  Mari Ostendorf,et al.  Maximum likelihood clustering of Gaussians for speech recognition , 1994, IEEE Trans. Speech Audio Process..

[8]  L. R. Rabiner,et al.  Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[9]  Mari Ostendorf,et al.  On the Use of Tied-Mixture Distributions , 1993, HLT.

[10]  Vassilios Digalakis,et al.  Techniques to Achieve an Accurate Real-Time Large-Vocabulary Speech Recognition System , 1994, HLT.

[11]  Michael Picheny,et al.  Context Dependent Modeling of Phones in Continuous Speech Using Decision Trees , 1991, HLT.

[12]  Jonathan G. Fiscus,et al.  1993 Benchmark Tests for the ARPA Spoken Language Program , 1994, HLT.

[13]  D. B. Paul,et al.  Speaker stress-resistant continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[14]  George R. Doddington Phonetically sensitive discriminants for improved speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[15]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[16]  Mari Ostendorf,et al.  ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition , 1993, IEEE Trans. Speech Audio Process..

[17]  Mei-Yuh Hwang,et al.  Subphonetic modeling with Markov states-Senone , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Steve Young,et al.  The general use of tying in phoneme-based HMM speech recognisers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[20]  Mitch Weintraub,et al.  Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter modeling for speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[22]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[23]  D. B. Paul,et al.  The Lincoln robust continuous speech recognizer , 1989, International Conference on Acoustics, Speech, and Signal Processing,.