Margin-space integration of MPE loss via differencing of MMI functionals for generalized error-weighted discriminative training

Abstract Using the central observation that margin-based weightedclassification error (modeled using Minimum Phone Error(MPE)) corresponds to the derivative with respect to the mar-gin term of margin-based hinge loss (modeled using MaximumMutual Information (MMI)), this article subsumes and extendsmargin-based MPE and MMI within a broader framework inwhich the objective function is an integral of MPE loss over arange of margin values. Applying the Fundamental Theorem ofCalculus,thisintegraliseasilyevaluatedusingfinitedifferencesof MMI functionals; lattice-based training using the new crite-rion can then be carried out using differences of MMI gradi-ents. Experimental results comparing the new framework withmargin-based MMI, MCE and MPE on the Corpus of Sponta-neous Japanese and the MIT OpenCourseWare/MIT-World cor-pus are presented. 1. Introduction The field of discriminative training for speech recognition haswitnessed considerable activity in recent years. The appeal ofminimizingphoneorworderrorratherthanstringerrorhasmo-tivated a transition from well-known string-level methods suchas MMI and MCE [1][2] to error-weighted approaches, such asMPE [3][4]. More recently, there has been a surge in proposalsfor“largemargin”approachestohiddenMarkovmodel(HMM)design, such as the “large-margin HMM” [5], “soft margin es-timation” [6], and incrementally shifted MCE loss [7]. Sha andSaul [8] made the important proposal that a fine-grained er-ror measure, such as the Hamming distance between candidaterecognition strings, be itself directly incorporated into the mar-gin term for HMM-based learning. It turns out that introducinga margin term that multiplies fine-grained error can easily bebrought to MMI, MCE and MPE based HMM training as well,simply by adding margin-scaled local frame/phone/word errorto lattice arc log-likelihoods during Forward-Backward com-putation [9][10][11]. This approach links the original use ofmargin in the context of machine learning (e.g. Support VectorMachines (SVMs)) with margin in the context of “tried-and-tested” frameworks for large-scale discriminative training withwell-understood methods for HMM optimization on large-scaleASR tasks. Benefits to performance for large-scale tasks havebeen reported for the use of margin in MMI and MPE, thoughit appears the relative gains are larger for MMI than for MPE[10][11].Aiming at leveraging the benefits of margin use within thecontextofMPE-styleerror-weightedHMMtraining,thisarticlepresents a unification of margin-based MMI and MPE trainingbased on a novel concept:

[1]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[2]  Shigeru Katagiri,et al.  A unified view for discriminative objective functions based on negative exponential of difference measure between strings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Hung-An Chang,et al.  Discriminative training of hierarchical acoustic models for large vocabulary continuous speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Dong Yu,et al.  Large-margin minimum classification error training: A theoretical risk minimization perspective , 2008, Comput. Speech Lang..

[5]  Hui Jiang,et al.  Large margin HMMs for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Atsushi Nakamura,et al.  String and lattice based discriminative training for the corpus of spontaneous Japanese lecture transcription task , 2007, INTERSPEECH.

[9]  Georg Heigold,et al.  Modified MMI/MPE: a direct evaluation of the margin in speech recognition , 2008, ICML '08.

[10]  Jinyu Li,et al.  A study on soft margin estimation for LVCSR , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  Dwi Sianto Mansjur,et al.  Non-Uniform error criteria for automatic pattern and speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[14]  George Saon,et al.  Penalty function maximization for large margin HMM training , 2008, INTERSPEECH.