Discriminative Learning for Speech Recognition: Theory and Practice

In this book, we introduce the background and mainstream methods of probabilistic modeling and discriminative parameter optimization for speech recognition. The specific models treated in depth include the widely used exponential-family distributions and the hidden Markov model. A detailed study is presented on unifying the common objective functions for discriminative learning in speech recognition, namely maximum mutual information (MMI), minimum classification error, and minimum phone/word error. The unification is presented, with rigorous mathematical analysis, in a common rational-function form. This common form enables the use of the growth transformation (or extended Baum–Welch) optimization framework in discriminative learning of model parameters. In addition to all the necessary introduction of the background and tutorial material on the subject, we also included technical details on the derivation of the parameter optimization formulas for exponential-family distribut ons, discrete hidden Markov models (HMMs), and continuous-density HMMs in discriminative learning. Selected experimental results obtained by the authors in firsthand are presented to show that discriminative learning can lead to superior speech recognition performance over conventional parameter learning. Details on major algorithmic implementation issues with practical significance are provided to enable the practitioners to directly reproduce the theory in the earlier part of the book into engineering practice. Table of Contents: Introduction and Background / Statistical Speech Recognition: A Tutorial / Discriminative Learning: A Unified Objective Function / Discriminative Learning Algorithm for Exponential-Family Distributions / Discriminative Learning Algorithm for Hidden Markov Model / Practical Implementation of Discriminative Learning / Selected Experimental Results / Epilogue / Major Symbols Used in the Book and Their Descriptions / Mathematical Notation / Bibliography

[1]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[2]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[3]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[4]  L. Baum,et al.  Growth transformations for functions on manifolds. , 1968 .

[5]  A. Nadas,et al.  A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .

[6]  John E. Dennis,et al.  Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[7]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[8]  Michael Picheny,et al.  On a model-robust training method for speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[9]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[10]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[11]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[12]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[13]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[14]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[15]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[16]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Yangsheng Xu,et al.  Hidden Markov model approach to skill learning and its application to telerobotics , 1993, IEEE Trans. Robotics Autom..

[18]  Dimitri Kanevsky A generalization of the Baum algorithm to functions on non-linear manifolds , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[19]  Yves Normandin Maximum Mutual Information Estimation of Hidden Markov Models , 1996 .

[20]  S. Young,et al.  Lattice-based discriminative training for large vocabulary speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[21]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[22]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[23]  Li Deng,et al.  Speech trajectory discrimination using the minimum classification error learning , 1998, IEEE Trans. Speech Audio Process..

[24]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[25]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[26]  Chin-Hui Lee,et al.  Minimum error rate training for PHMM-based text recognition , 1999, IEEE Trans. Image Process..

[27]  Ralf Schlüter,et al.  Investigations on discriminative training criteria , 2000 .

[28]  Alex Pentland,et al.  On Reversing Jensen's Inequality , 2000, NIPS.

[29]  Daniel Povey,et al.  Large scale discriminative training for speech recognition , 2000 .

[30]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[31]  Qiang Huo,et al.  A Discrete Contextual Stochastic Model for the Offline Recognition of Handwritten Chinese Characters , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  William J. Byrne,et al.  Discriminative speaker adaptation with conditional maximum likelihood linear regression , 2001, INTERSPEECH.

[33]  Ewan Birney,et al.  Hidden Markov models in biological sequence analysis , 2001, IBM J. Res. Dev..

[34]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[35]  Hermann Ney,et al.  Comparison of discriminative training criteria and optimization methods for speech recognition , 2001, Speech Commun..

[36]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[37]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Alex Pentland,et al.  Discriminative, generative and imitative learning , 2002 .

[39]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[40]  Mark J. F. Gales,et al.  MMI-MAP and MPE-MAP for acoustic model adaptation , 2003, INTERSPEECH.

[41]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[42]  Vaibhava Goel,et al.  Minimum Bayes-Risk Methods in Automatic Speech Recognition , 2003 .

[43]  Wu Chou,et al.  Minimum classification error linear regression for acoustic model adaptation of continuous density HMMs , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[44]  Wu Chou Minimum Classification Error (MCE) Approach in Pattern Recognition , 2003 .

[45]  Erik McDermott,et al.  Minimum classification error training of landmark models for real-time continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Jie Yang,et al.  A discriminative learning framework with pairwise constraints for video object classification , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[47]  Dimitri Kanevsky Extended Baum transformations for general functions , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Li Deng,et al.  Analysis and comparison of two speech feature extraction/compensation algorithms , 2005, IEEE Signal Processing Letters.

[49]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[50]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[51]  Yi Li,et al.  A generative/discriminative learning algorithm for image classification , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[52]  Fernando Pereira Linear models for structure prediction , 2005, INTERSPEECH.

[53]  Dong Yu,et al.  A Generative Modeling Framework for Structured Hidden Speech Dynamics , 2005 .

[54]  Mohamed Afify Extended baum-welch reestimation of Gaussian mixture models based on reverse Jensen inequality , 2005, INTERSPEECH.

[55]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[56]  Alex Acero,et al.  Training Algorithms for Hidden Conditional Random Fields , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[57]  Alex Acero,et al.  Joint Discriminative Front End and Back End Training for Improved Speech Recognition Accuracy , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[58]  Yuqing Gao,et al.  Maximum entropy direct models for speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[59]  Wu Chou,et al.  A Novel Learning Method for Hidden Markov Models in Speech and Audio Processing , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[60]  Scott Axelrod,et al.  Discriminative Estimation of Subspace Constrained Gaussian Mixture Models for Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[61]  Dong Yu,et al.  Large-Margin Minimum Classification Error Training for Large-Scale Speech Recognition Tasks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[62]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[63]  Radford M. Neal,et al.  Haplotype inference using a Bayesian Hidden Markov model , 2007, Genetic epidemiology.

[64]  S. Katagiri,et al.  Discriminative Learning for Minimum Error Classification , 2009 .