Reduced gaussian mixture models in a large vocabulary continuous speech recognizer

Large vocabulary continuous speech recognition (LVCSR) systems usually employ several tens of thousands of gaussian mixture components for an accurate statistical representation of naturally spoken human speech. For applications that cannot e ort the computational expensive evaluation of numerous Gaussians during recognition time, it is an important question whether the number of Gaussians can be signi cantly reduced without a large degradation in recognition accuracy. In this paper we introduce two new methods for the pruning of Gaussians in a continuous density HMM based speech recognizer that address either the contribution of a Gaussian to the observation likelihood of a HMM state or the reliability of parameter estimation during acoustic model training. Experimental results show that we can reduce the number of mixture components by more than 33 percent, whereas the speaker independent word error rate shows a relative increase of only 2 percent.

[1]  Douglas B. Paul,et al.  The Lincoln tied-mixture HMM continuous speech recognizer , 1990, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Michael Picheny,et al.  Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[5]  Yves Normandin Optimal splitting of HMM Gaussian mixture components with MMIE training , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Michael Picheny,et al.  Context dependent vector quantization for continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Michèle Jardino Multilingual stochastic n-gram class language models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Ponani S. Gopalakrishnan,et al.  Clustering via the Bayesian information criterion with applications in speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).