Large scale discriminative training of hidden Markov models for speech recognition

This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). This paper concentrates on the maximum mutual information estimation (MMIE) criterion which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largest-scale application of discriminative training techniques for speech recognition of which the authors are aware. Details are given of the MMIE lattice-based implementation used with the extended Baum-Welch algorithm, which makes training of such large systems computationally feasible. Techniques for improving generalization using acoustic scaling and weakened language models are discussed. The overall technique has allowed the estimation of triphone and quinphone HMM parameters which has led to significant reductions in word error rate for the transcription of conversational telephone speech relative to our best systems trained using maximum likelihood estimation (MLE). This is in contrast to some previous studies, which have concluded that there is little benefit in using discriminative training for the most difficult large vocabulary speech recognition tasks. The lattice MMIE-based discriminative training scheme is also shown to out-perform the frame discrimination technique. Various properties of the lattice-based MMIE training scheme are investigated including comparisons of different lattice processing strategies (full search and exact-match) and the effect of lattice size on performance. Furthermore a scheme based on the linear interpolation of the MMIE and MLE objective functions is shown to reduce the danger of over-training. It is shown that HMMs trained with MMIE benefit as much as MLE-trained HMMs from applying model adaptation using maximum likelihood linear regression (MLLR). This has allowed the straightforward integration of MMIE-trained HMMs into complex multi-pass systems for transcription of conversational telephone speech and has contributed to our MMIE-trained systems giving the lowest word error rates in both the 2000 and 2001 NIST Hub5 evaluations.

[1]  A. Nadas,et al.  A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .

[2]  Michael Picheny,et al.  On a model-robust training method for speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[3]  Harvey F. Silverman,et al.  Hidden Markov model/neural network training techniques for connected alphadigit speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[5]  Salvatore D. Morgera,et al.  An improved MMIE training algorithm for speaker-independent, small vocabulary, continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Biing-Hwang Juang,et al.  Minimum error rate training based on N-best string models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[9]  V. Valtchev,et al.  The effect of the metal substrate composition on the crystallization of zeolite coatings , 1995 .

[10]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[11]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[12]  S. Young,et al.  Lattice-based discriminative training for large vocabulary speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[14]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[15]  F. Kapadia Unexpected endotracheal extubations , 1998, The Lancet.

[16]  Thomas Niesler,et al.  The 1998 HTK system for transcription of conversational telephone speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[17]  Daniel Povey,et al.  Frame discrimination training for HMMs for large vocabulary speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).