Large margin training of acoustic models for speech recognition

Automatic speech recognition (ASR) depends critically on building acoustic models for linguistic units. These acoustic models usually take the form of continuous-density hidden Markov models (CD-HMMs), whose parameters are obtained by maximum likelihood estimation. Recently, however, there has been growing interest in discriminative methods for parameter estimation in CD-HMMs. This thesis applies the idea of large margin training to parameter estimation in CD-HMMs. The principles of large margin training have been intensively studied, most prominently in support vector machines (SVMs). In SVMs, large margin training presents an attractive conceptual framework because it provides theoretical guarantees that balance model complexity versus generalization. It also presents an attractive computational framework because it casts many learning problems as tractable convex optimizations. This thesis extends and develops large margin methods for estimating the parameters of acoustic models for ASR. As in SVMs, the starting point is to postulate that correct and incorrect classifications are separated by a large margin; model parameters are then optimized to maximize this margin. This thesis presents algorithms for training Gaussian mixture models both as multiway classifiers in their own right and as individual components of larger models (e.g., observation models in CD-HMMs). The new techniques differ from previous discriminative methods for ASR in the goal of margin maximization. Additionally, the new techniques lead to efficient algorithms based on convex optimizations. This thesis evaluates the utility of large margin training on two benchmark problems in acoustic modeling: phonetic classification and recognition on the TIMIT speech database. In both tasks, large margin systems obtain significantly better performance than systems trained by maximum likelihood estimation or competing discriminative frameworks, such as conditional maximum likelihood and minimum classification error. This thesis also examines the utility of subgradient and extragradient methods, both of which were recently proposed for large margin training in domains other than ASR. Comparative experimental results suggest that our learning methods both scale better and yield better performance. The thesis concludes with brief discussions of future research directions, including the application of large margin training techniques to large vocabulary ASR.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[3]  A. Nadas,et al.  A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .

[4]  A. Barrett Network Flows and Monotropic Optimization. , 1984 .

[5]  Editors , 1986, Brain Research Bulletin.

[6]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Michael Picheny,et al.  On a model-robust training method for speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[8]  Raj Reddy,et al.  Automatic Speech Recognition: The Development of the Sphinx Recognition System , 1988 .

[9]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[10]  Biing-Hwang Juang,et al.  The segmental K-means algorithm for estimating parameters of hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[11]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[12]  Salvatore D. Morgera,et al.  An improved MMIE training algorithm for speaker-independent, small vocabulary, continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[13]  Marc Teboulle,et al.  Entropic Proximal Mappings with Applications to Nonlinear Programming , 1992, Math. Oper. Res..

[14]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[15]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[16]  Biing-Hwang Juang,et al.  Minimum error rate training based on N-best string models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Jean-Luc Gauvain,et al.  High performance speaker-independent phone recognition using CDHMM , 1993, EUROSPEECH.

[18]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[19]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Régis Cardin,et al.  MMIE training for large vocabulary continuous speech recognition , 1994, ICSLP.

[21]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[22]  Andrej Ljolje,et al.  High accuracy phone recognition using context clustering and quasi-triphonic models , 1994, Comput. Speech Lang..

[23]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[24]  Michael I. Jordan Why the logistic function? A tutorial discussion on probabilities and neural networks , 1995 .

[25]  Harris Drucker,et al.  Comparison of learning algorithms for handwritten digit recognition , 1995 .

[26]  Jont B. Allen,et al.  ASA Edition of Speech and Hearing in Communication , 1996 .

[27]  Hermann Ney,et al.  DYNAMIC PROGRAMMING SEARCH STRATEGIES: FROM DIGIT STRINGS TO LARGE VOCABULARY WORD GRAPHS , 1996 .

[28]  James R. Glass,et al.  Heterogeneous acoustic measurements for phonetic classification 1 , 1997, EUROSPEECH.

[29]  James R. Glass,et al.  HETEROGENEOUS ACOUSTIC MEASUREMENTS FOR PHONETIC CLASSIFICATION , 1997 .

[30]  Paul Tseng,et al.  An ε-Relaxation Method for Separable Convex Cost Network Flow Problems , 1997, SIAM J. Optim..

[31]  Erik McDermott,et al.  Discriminative Training for Speech Recognition , 1997 .

[32]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[33]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[34]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[35]  Joseph Picone,et al.  Support vector machines for speech recognition , 1998, ICSLP.

[36]  Sadik Kapadia,et al.  Discriminative Training of Hidden Markov Models , 1998 .

[37]  Alex Pentland,et al.  Maximum Conditional Likelihood via Bound Maximization and the CEM Algorithm , 1998, NIPS.

[38]  Francis Jack Smith,et al.  Improved phone recognition using Bayesian triphone models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[39]  R. Horst,et al.  DC Programming: Overview , 1999 .

[40]  Steve Young,et al.  Acoustic Modelling for Large Vocabulary Continuous Speech Recognition , 1999 .

[41]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[42]  Pedro J. Moreno,et al.  On the use of support vector machines for phonetic classification , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[43]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[44]  Andrew K. Halberstadt Heterogeneous acoustic measurements and multiple classifiers for speech recognition , 1999 .

[45]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[46]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[47]  William J. Byrne,et al.  Discriminative speaker adaptation with conditional maximum likelihood linear regression , 2001, INTERSPEECH.

[48]  Philip C. Woodland,et al.  Improvements in linear transform based speaker adaptation , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[49]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[50]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[51]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[52]  Hermann Ney,et al.  Comparison of discriminative training criteria and optimization methods for speech recognition , 2001, Speech Commun..

[53]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[54]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[55]  P. Woodland,et al.  Discriminative linear transforms for speaker adaptation , 2001 .

[56]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.

[57]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[58]  Mark J. F. Gales,et al.  Speech Recognition using SVMs , 2001, NIPS.

[59]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[60]  Alexander H. Waibel,et al.  On maximum mutual information speaker-adapted training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[61]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[62]  Amnon Shashua,et al.  Ranking with Large Margin Principle: Two Approaches , 2002, NIPS.

[63]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[64]  L. Liao,et al.  Improvements of Some Projection Methods for Monotone Nonlinear Variational Inequalities , 2002 .

[65]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[66]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[67]  Michael I. Jordan,et al.  Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates , 2003, NIPS.

[68]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[69]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[70]  Wu Chou,et al.  Minimum classification error (MCE) model adaptation of continuous density HMMS , 2003, INTERSPEECH.

[71]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[72]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[73]  William M. Campbell,et al.  Phonetic Speaker Recognition with Support Vector Machines , 2003, NIPS.

[74]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[75]  Arkadi Nemirovski,et al.  Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems , 2004, SIAM J. Optim..

[76]  Larry Wasserman,et al.  All of Statistics , 2004 .

[77]  Philip D. Plowright,et al.  Convexity , 2019, Optimization for Chemical and Biochemical Engineering.

[78]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[79]  Mark J. F. Gales,et al.  Training LVCSR systems on thousands of hours of data , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[80]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[81]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[82]  Koby Crammer Online Learning for Complex Cat-egorial Problems , 2005 .

[83]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[84]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[85]  Mohamed Afify Extended baum-welch reestimation of Gaussian mixture models based on reverse Jensen inequality , 2005, INTERSPEECH.

[86]  Jonathan Le Roux,et al.  Optimization methods for discriminative training , 2005, INTERSPEECH.

[87]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[88]  Shigeru Katagiri,et al.  Minimum classification error for large scale speech recognition tasks using weighted finite state transducers , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[89]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[90]  Hui Jiang,et al.  Large margin HMMs for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[91]  Ben Taskar,et al.  Structured Prediction via the Extragradient Method , 2005, NIPS.

[92]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[93]  Jinyu Li,et al.  Soft margin estimation of hidden Markov model parameters , 2006, INTERSPEECH.

[94]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[95]  Samy Bengio,et al.  Discriminative kernel-based phoneme sequence recognition , 2006, INTERSPEECH.

[96]  Jason Weston,et al.  Trading convexity for scalability , 2006, ICML.

[97]  Pieter Abbeel,et al.  Max-margin classification of incomplete data , 2006, NIPS.

[98]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[99]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[100]  Tony Jebara,et al.  Machine learning: Discriminative and generative , 2006 .

[101]  Nathan D. Ratliff,et al.  Subgradient Methods for Maximum Margin Structured Learning , 2006 .

[102]  Dong Yu,et al.  Structured speech modeling , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[103]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[104]  Ben Taskar,et al.  Structured Prediction, Dual Extragradient and Bregman Projections , 2006, J. Mach. Learn. Res..

[105]  Xiaodong He,et al.  Use of incrementally regulated discriminative margins in MCE training for speech recognition , 2006, INTERSPEECH.

[106]  A. Banerjee Convex Analysis and Optimization , 2006 .

[107]  Charles A. Micchelli,et al.  A DC-programming algorithm for kernel selection , 2006, ICML.

[108]  Lawrence K. Saul,et al.  Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[109]  Andreas Stolcke,et al.  Generalized Linear Kernels for One-Versus-All Classification: Application to Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[110]  Yurii Nesterov,et al.  Dual extrapolation and its applications to solving variational inequalities and related problems , 2003, Math. Program..

[111]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[112]  Shantanu Chakrabartty,et al.  Ginisupport vector machines for segmental minimum Bayes risk decoding of continuous speech , 2007, Comput. Speech Lang..

[113]  Lawrence K. Saul,et al.  Comparison of Large Margin Training to Other Discriminative Methods for Phonetic Recognition by Hidden Markov Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[114]  S. Katagiri,et al.  Discriminative Learning for Minimum Error Classification , 2009 .

[115]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .