Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition

The authors demonstrate the effectiveness of phonemic hidden Markov models with Gaussian mixture output densities (mixture HMMs) for speaker-dependent large-vocabulary word recognition. Speech recognition experiments show that for almost any reasonable amount of training data, recognizers using mixture HMMs consistently outperform those employing unimodal Gaussian HMMs. With a sufficiently large training set (e.g. more than 2500 words), use of HMMs with 25-component mixture distributions typically reduces recognition errors by about 40%. It is also found that the mixture HMMs outperform a set of unimodal generalized triphone models having the same number of parameters. Previous attempts to employ mixture HMMs for speech recognition proved discouraging because of the high complexity and computational cost in implementing the Baum-Welch training algorithm. It is shown how mixture HMMs can be implemented very simply in unimodal transition-based frameworks by allowing multiple transitions from one state to another. >

[1]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter modeling for speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[2]  A. Nadas,et al.  Automatic speech recognition via pseudo-independent marginal mixtures , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  A. Poritz,et al.  On hidden Markov models in isolated word recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Biing-Hwang Juang,et al.  Maximum likelihood estimation for multivariate mixture observations of markov chains , 1986, IEEE Trans. Inf. Theory.

[5]  Patrick Kenny,et al.  Modeling acoustic transitions in speech by state-interpolation hidden Markov models , 1992, IEEE Trans. Signal Process..

[6]  P. Mermelstein,et al.  Fast search strategy in a large vocabulary word recognizer , 1988 .

[7]  M. Tomlinson,et al.  The discriminative network: A mechanism for focusing recognition in whole-word pattern matching , 1983, ICASSP.

[8]  Li Deng,et al.  Use of vowel duration information in a large vocabulary word recognizer , 1989 .

[9]  Li Deng,et al.  Large vocabulary word recognition using context-dependent allophonic hidden Markov models☆ , 1990 .

[10]  Hermann Ney,et al.  Phoneme modelling using continuous mixture densities , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[11]  L. Deng,et al.  Modeling microsegments of stop consonants in a hidden Markov model based word recognizer , 1990 .

[12]  Biing-Hwang Juang,et al.  Mixture autoregressive hidden Markov models for speaker independent isolated word recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.