Investigations on an EM-Style Optimization Algorithm for Discriminative Training of HMMs

Today's speech recognition systems are based on hidden Markov models (HMMs) with Gaussian mixture models whose parameters are estimated using a discriminative training criterion such as Maximum Mutual Information (MMI) or Minimum Phone Error (MPE). Currently, the optimization is almost always done with (empirical variants of) Extended Baum-Welch (EBW). This type of optimization requires sophisticated update schemes for the step sizes and a considerable amount of parameter tuning, and only little is known about its convergence behavior. In this paper, we derive an EM-style algorithm for discriminative training of HMMs. Like Expectation-Maximization (EM) for the generative training of HMMs, the proposed algorithm improves the training criterion on each iteration, converges to a local optimum, and is completely parameter-free. We investigate the feasibility of the proposed EM-style algorithm for discriminative training of two tasks, namely grapheme-to-phoneme conversion and spoken digit string recognition.

[1]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Hermann Ney,et al.  Deformation Models for Image Recognition , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[4]  Georg Heigold,et al.  EM-style optimization of hidden conditional random fields for grapheme-to-phoneme conversion , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hermann Ney,et al.  Discriminative training with tied covariance matrices , 2004, INTERSPEECH.

[6]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[7]  Salvatore D. Morgera,et al.  An improved MMIE training algorithm for speaker-independent, small vocabulary, continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Tanja Schultz,et al.  Generalized Baum-Welch algorithm for discriminative training on large vocabulary continuous speech recognition system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[10]  Stefan Riezler,et al.  Probabilistic Constraint Logic Programming , 1997, ArXiv.

[11]  Scott Axelrod,et al.  Discriminative Estimation of Subspace Constrained Gaussian Mixture Models for Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[13]  Mark Johnson,et al.  Lexicalized Stochastic Modeling of Constraint-Based Grammars using Log-Linear Measures and EM Training , 2000, ACL.

[14]  Georg Heigold,et al.  Equivalence of Generative and Log-Linear Models , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Hermann Ney,et al.  A Convergence Analysis of Log-Linear Training , 2011, NIPS.

[16]  Georg Heigold,et al.  Margin-Based Discriminative Training for String Recognition , 2010, IEEE Journal of Selected Topics in Signal Processing.

[17]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[18]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Georg Heigold,et al.  Modified MPE/MMI in a transducer-based framework , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[21]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[22]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[23]  Georg Heigold,et al.  Modified MMI/MPE: a direct evaluation of the margin in speech recognition , 2008, ICML '08.

[24]  George D. Magoulas,et al.  New globally convergent training scheme based on the resilient propagation algorithm , 2005, Neurocomputing.

[25]  L. Armijo Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[26]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[27]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[28]  Detlev Langmann,et al.  A comparative study of linear feature transformation techniques for automatic speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[29]  E. Ising Beitrag zur Theorie des Ferromagnetismus , 1925 .

[30]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[31]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Georg Heigold,et al.  A log-linear discriminative modeling framework for speech recognition , 2010 .

[33]  Dale Schuurmans,et al.  The latent maximum entropy principle , 2002, Proceedings IEEE International Symposium on Information Theory,.

[34]  Alex Pentland,et al.  Discriminative, generative and imitative learning , 2002 .

[35]  Mohamed Afify Extended baum-welch reestimation of Gaussian mixture models based on reverse Jensen inequality , 2005, INTERSPEECH.

[36]  Ralf Schlüter,et al.  Investigations on discriminative training criteria , 2000 .

[37]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.