Maximum accept and reject (MARS) training of HMM-GMM speech recognition systems

This paper describes a new discriminative HMM parameter estimation technique. It supplements the usual ML optimization function with the emission (accept) likelihood of the aligned state (phone) and the rejection likelihoods from the rest of the states (phones). Intuitively, this new optimization function takes into the account as to how well the other states are rejecting the current frame that has been aligned with a given state. This simple scheme, termed as Maximum Accept and Reject (MARS), implicitly brings in the discriminative information and hence performs better than the ML trained models. As is well known, maximum mutual information (MMI)[3, 4] training needs a language model (lattice), encoding all possible sentences[7, 9], that could occur in the test conditions. MMI training uses this language model (lattice) to identify the confusable segments of speech in the form of the so-called ”denominator” state occupation statistics [7]. However, this implicitly ties the MMI trained acoustic model to a particular task-domain. MARS training does not face this constraint as it finds the confusable states at the frame level and hence does not use a language model (lattice) during training.

[1]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[2]  R. Okafor Maximum likelihood estimation from incomplete data , 1987 .

[3]  Yves Normandin Optimal splitting of HMM Gaussian mixture components with MMIE training , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Hervé Bourlard,et al.  An introduction to the hybrid hmm/connectionist approach , 1995 .

[7]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[8]  Daniel Povey,et al.  Frame discrimination training for HMMs for large vocabulary speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Tara N. Sainath,et al.  Broad phonetic class recognition in a Hidden Markov model framework using extended Baum-Welch transformations , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[10]  A. Nadas,et al.  A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .

[11]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.