Discriminative learning in sequential pattern recognition

In this article, we studied the objective functions of MMI, MCE, and MPE/MWE for discriminative learning in sequential pattern recognition. We presented an approach that unifies the objective functions of MMI, MCE, and MPE/MWE in a common rational-function form of (25). The exact structure of the rational-function form for each discriminative criterion was derived and studied. While the rational-function form of MMI has been known in the past, we provided the theoretical proof that the similar rational-function form exists for the objective functions of MCE and MPE/MWE. Moreover, we showed that the rational function forms for objective functions of MMI, MCE, and MPE/MWE differ in the constant weighting factors CDT (s1 . . . sR) and these weighting factors depend only on the labeled sequence s1 . . . sR, and are independent of the parameter set - to be optimized. The derived rational-function form for MMI, MCE, and MPE/MWE allows the GT/EBW-based parameter optimization framework to be applied directly in discriminative learning. In the past, lack of the appropriate rational-function form was a difficulty for MCE and MPE/MWE, because without this form, the GT/EBW-based parameter optimization framework cannot be directly applied. Based on the unified rational-function form, in a tutorial style, we derived the GT/EBW-based parameter optimization formulas for both discrete HMMs and CDHMMs in discriminative learning using MMI, MCE, and MPE/MWE criteria. The unifying review provided in this article has been based upon a large number of earlier contributions that have been cited and discussed throughout the article. Here we provide a brief summary of such background work. Extension to large-scale speech recognition tasks was accomplished in the work of [59] and [60]. The dissertation of [47] further improved the MMI criterion to that of MPE/MWE. In a parallel vein, the work of [20] provided an alternative approach to that of [41], with an attempt to more rigorously provide a CDHMM model re-estimation formula that gives positive growth of the MMI objective function. A crucial error of this attempt was corrected in [2] for establishing an existence proof of such positive growth. The main goal of this article is to provide an underlying foundation for MMI, MCE, and MPE/MWE at the objective function level to facilitate the development of new parameter optimization techniques and to incorporate other pattern recognition concepts, e.g., discriminative margins [66], into the current discriminative learning paradigm.

[1]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[2]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[3]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[4]  L. Baum,et al.  Growth transformations for functions on manifolds. , 1968 .

[5]  Volume Assp,et al.  ACOUSTICS. SPEECH. AND SIGNAL PROCESSING , 1983 .

[6]  A. Nadas,et al.  A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .

[7]  John E. Dennis,et al.  Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[8]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[9]  Michael Picheny,et al.  On a model-robust training method for speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[10]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[11]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[12]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[13]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[14]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[15]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[16]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[17]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Yangsheng Xu,et al.  Hidden Markov model approach to skill learning and its application to telerobotics , 1993, IEEE Trans. Robotics Autom..

[19]  Dimitri Kanevsky A generalization of the Baum algorithm to functions on non-linear manifolds , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Yves Normandin Maximum Mutual Information Estimation of Hidden Markov Models , 1996 .

[21]  S. Young,et al.  Lattice-based discriminative training for large vocabulary speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[22]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[23]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[24]  Li Deng,et al.  Speech trajectory discrimination using the minimum classification error learning , 1998, IEEE Trans. Speech Audio Process..

[25]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[26]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[27]  Chin-Hui Lee,et al.  Minimum error rate training for PHMM-based text recognition , 1999, IEEE Trans. Image Process..

[28]  Ralf Schlüter,et al.  Investigations on discriminative training criteria , 2000 .

[29]  Alex Pentland,et al.  On Reversing Jensen's Inequality , 2000, NIPS.

[30]  Daniel Povey,et al.  Large scale discriminative training for speech recognition , 2000 .

[31]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[32]  Qiang Huo,et al.  A Discrete Contextual Stochastic Model for the Offline Recognition of Handwritten Chinese Characters , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  William J. Byrne,et al.  Discriminative speaker adaptation with conditional maximum likelihood linear regression , 2001, INTERSPEECH.

[34]  Ewan Birney,et al.  Hidden Markov models in biological sequence analysis , 2001, IBM J. Res. Dev..

[35]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[36]  Hermann Ney,et al.  Comparison of discriminative training criteria and optimization methods for speech recognition , 2001, Speech Commun..

[37]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[38]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Alex Pentland,et al.  Discriminative, generative and imitative learning , 2002 .

[40]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[41]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[42]  Mark J. F. Gales,et al.  MMI-MAP and MPE-MAP for acoustic model adaptation , 2003, INTERSPEECH.

[43]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[44]  Wu Chou,et al.  Minimum classification error linear regression for acoustic model adaptation of continuous density HMMs , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[45]  Wu Chou Minimum Classification Error (MCE) Approach in Pattern Recognition , 2003 .

[46]  Wu Chou,et al.  A minimum classification error (MCE) framework for generalized linear classifier in machine learning for text categorization/retrieval , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[47]  Erik McDermott,et al.  Minimum classification error training of landmark models for real-time continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Jie Yang,et al.  A discriminative learning framework with pairwise constraints for video object classification , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[49]  Li Deng,et al.  Analysis and comparison of two speech feature extraction/compensation algorithms , 2005, IEEE Signal Processing Letters.

[50]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[51]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[52]  Yi Li,et al.  A generative/discriminative learning algorithm for image classification , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[53]  Fernando Pereira Linear models for structure prediction , 2005, INTERSPEECH.

[54]  Dong Yu,et al.  A Generative Modeling Framework for Structured Hidden Speech Dynamics , 2005 .

[55]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[56]  Alex Acero,et al.  Joint Discriminative Front End and Back End Training for Improved Speech Recognition Accuracy , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[57]  Yuqing Gao,et al.  Maximum entropy direct models for speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Wu Chou,et al.  A Novel Learning Method for Hidden Markov Models in Speech and Audio Processing , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[59]  Scott Axelrod,et al.  Discriminative Estimation of Subspace Constrained Gaussian Mixture Models for Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[60]  Dong Yu,et al.  Large-Margin Minimum Classification Error Training for Large-Scale Speech Recognition Tasks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[61]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  Radford M. Neal,et al.  Haplotype inference using a Bayesian Hidden Markov model , 2007, Genetic epidemiology.

[63]  Xiaodong He,et al.  Discriminative Learning for Speech Recognition: Theory and Practice , 2008, Discriminative Learning for Speech Recognition.

[64]  Frank K. Soong,et al.  A Constrained Line Search Optimization Method for Discriminative Training of HMMs , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[65]  S. Katagiri,et al.  Discriminative Learning for Minimum Error Classification , 2009 .