论文信息 - Discriminative learning in sequential pattern recognition

Discriminative learning in sequential pattern recognition

In this article, we studied the objective functions of MMI, MCE, and MPE/MWE for discriminative learning in sequential pattern recognition. We presented an approach that unifies the objective functions of MMI, MCE, and MPE/MWE in a common rational-function form of (25). The exact structure of the rational-function form for each discriminative criterion was derived and studied. While the rational-function form of MMI has been known in the past, we provided the theoretical proof that the similar rational-function form exists for the objective functions of MCE and MPE/MWE. Moreover, we showed that the rational function forms for objective functions of MMI, MCE, and MPE/MWE differ in the constant weighting factors CDT (s1 . . . sR) and these weighting factors depend only on the labeled sequence s1 . . . sR, and are independent of the parameter set - to be optimized. The derived rational-function form for MMI, MCE, and MPE/MWE allows the GT/EBW-based parameter optimization framework to be applied directly in discriminative learning. In the past, lack of the appropriate rational-function form was a difficulty for MCE and MPE/MWE, because without this form, the GT/EBW-based parameter optimization framework cannot be directly applied. Based on the unified rational-function form, in a tutorial style, we derived the GT/EBW-based parameter optimization formulas for both discrete HMMs and CDHMMs in discriminative learning using MMI, MCE, and MPE/MWE criteria. The unifying review provided in this article has been based upon a large number of earlier contributions that have been cited and discussed throughout the article. Here we provide a brief summary of such background work. Extension to large-scale speech recognition tasks was accomplished in the work of [59] and [60]. The dissertation of [47] further improved the MMI criterion to that of MPE/MWE. In a parallel vein, the work of [20] provided an alternative approach to that of [41], with an attempt to more rigorously provide a CDHMM model re-estimation formula that gives positive growth of the MMI objective function. A crucial error of this attempt was corrected in [2] for establishing an existence proof of such positive growth. The main goal of this article is to provide an underlying foundation for MMI, MCE, and MPE/MWE at the objective function level to facilitate the development of new parameter optimization techniques and to incorporate other pattern recognition concepts, e.g., discriminative margins [66], into the current discriminative learning paradigm.

[1] Shun-ichi Amari,et al. A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[2] L. Baum,et al. An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[3] B. Ripley,et al. Pattern Recognition , 1968, Nature.

[4] L. Baum,et al. Growth transformations for functions on manifolds. , 1968 .

[5] Volume Assp,et al. ACOUSTICS. SPEECH. AND SIGNAL PROCESSING , 1983 .

[6] A. Nadas,et al. A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .

[7] John E. Dennis,et al. Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[8] Peter F. Brown,et al. The acoustic-modeling problem in automatic speech recognition , 1987 .

[9] Michael Picheny,et al. On a model-robust training method for speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[10] Scott E. Fahlman,et al. An empirical study of learning speed in back-propagation networks , 1988 .

[11] Dimitri Kanevsky,et al. An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[12] Yves Normandin,et al. Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[13] Roberto Battiti,et al. First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[14] Biing-Hwang Juang,et al. Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[15] Martin A. Riedmiller,et al. A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[16] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[17] Steve J. Young,et al. MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18] Yangsheng Xu,et al. Hidden Markov model approach to skill learning and its application to telerobotics , 1993, IEEE Trans. Robotics Autom..

[19] Dimitri Kanevsky. A generalization of the Baum algorithm to functions on non-linear manifolds , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[20] Yves Normandin. Maximum Mutual Information Estimation of Hidden Markov Models , 1996 .

[21] S. Young,et al. Lattice-based discriminative training for large vocabulary speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[22] Biing-Hwang Juang,et al. Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[23] Steve J. Young,et al. MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[24] Li Deng,et al. Speech trajectory discrimination using the minimum classification error learning , 1998, IEEE Trans. Speech Audio Process..

[25] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[26] Sean R. Eddy,et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[27] Chin-Hui Lee,et al. Minimum error rate training for PHMM-based text recognition , 1999, IEEE Trans. Image Process..

[28] Ralf Schlüter,et al. Investigations on discriminative training criteria , 2000 .

[29] Alex Pentland,et al. On Reversing Jensen's Inequality , 2000, NIPS.

[30] Daniel Povey,et al. Large scale discriminative training for speech recognition , 2000 .

[31] Andrew McCallum,et al. Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[32] Qiang Huo,et al. A Discrete Contextual Stochastic Model for the Offline Recognition of Handwritten Chinese Characters , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[33] William J. Byrne,et al. Discriminative speaker adaptation with conditional maximum likelihood linear regression , 2001, INTERSPEECH.

[34] Ewan Birney,et al. Hidden Markov models in biological sequence analysis , 2001, IBM J. Res. Dev..

[35] Alex Acero,et al. Spoken Language Processing , 2001 .

[36] Hermann Ney,et al. Comparison of discriminative training criteria and optimization methods for speech recognition , 2001, Speech Commun..

[37] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[38] Daniel Povey,et al. Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39] Alex Pentland,et al. Discriminative, generative and imitative learning , 2002 .

[40] Daniel Povey,et al. Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[41] Michael Collins,et al. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[42] Mark J. F. Gales,et al. MMI-MAP and MPE-MAP for acoustic model adaptation , 2003, INTERSPEECH.

[43] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[44] Wu Chou,et al. Minimum classification error linear regression for acoustic model adaptation of continuous density HMMs , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[45] Wu Chou. Minimum Classification Error (MCE) Approach in Pattern Recognition , 2003 .

[46] Wu Chou,et al. A minimum classification error (MCE) framework for generalized linear classifier in machine learning for text categorization/retrieval , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[47] Erik McDermott,et al. Minimum classification error training of landmark models for real-time continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48] Jie Yang,et al. A discriminative learning framework with pairwise constraints for video object classification , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[49] Li Deng,et al. Analysis and comparison of two speech feature extraction/compensation algorithms , 2005, IEEE Signal Processing Letters.

[50] Geoffrey Zweig,et al. fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[51] Hermann Ney,et al. Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[52] Yi Li,et al. A generative/discriminative learning algorithm for image classification , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[53] Fernando Pereira. Linear models for structure prediction , 2005, INTERSPEECH.

[54] Dong Yu,et al. A Generative Modeling Framework for Structured Hidden Speech Dynamics , 2005 .

[55] Alex Acero,et al. Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[56] Alex Acero,et al. Joint Discriminative Front End and Back End Training for Improved Speech Recognition Accuracy , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[57] Yuqing Gao,et al. Maximum entropy direct models for speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[58] Wu Chou,et al. A Novel Learning Method for Hidden Markov Models in Speech and Audio Processing , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[59] Scott Axelrod,et al. Discriminative Estimation of Subspace Constrained Gaussian Mixture Models for Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[60] Dong Yu,et al. Large-Margin Minimum Classification Error Training for Large-Scale Speech Recognition Tasks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[61] Jonathan Le Roux,et al. Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[62] Radford M. Neal,et al. Haplotype inference using a Bayesian Hidden Markov model , 2007, Genetic epidemiology.

[63] Xiaodong He,et al. Discriminative Learning for Speech Recognition: Theory and Practice , 2008, Discriminative Learning for Speech Recognition.

[64] Frank K. Soong,et al. A Constrained Line Search Optimization Method for Discriminative Training of HMMs , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[65] S. Katagiri,et al. Discriminative Learning for Minimum Error Classification , 2009 .