Overview of large scale optimization for discriminative training in speech recognition

Over the past few decades, a variety of specialized approaches have been proposed to solve large problems in speech recognition. Conventional optimization techniques have not been widely applied, because the problems do not readily admit an objective for evaluating a given set of parameters and because of the large number of parameters. This situation is changing, due to recent developments in algorithmic optimization. In this paper, we review the specialized algorithms, including methods derived from the extended Baum-Welch (EBW) approach, Rprop, and GIS. We discuss optimization frameworks that could also potentially be applied, and outline some connections between the optimization methods and existing specialized methods.

[1]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[2]  Salvatore D. Morgera,et al.  An improved MMIE training algorithm for speaker-independent, small vocabulary, continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[4]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[5]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[6]  Hermann Ney,et al.  Comparison of discriminative training criteria and optimization methods for speech recognition , 2001, Speech Commun..

[7]  Dimitri Kanevsky Extended Baum transformations for general functions , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  George D. Magoulas,et al.  New globally convergent training scheme based on the resilient propagation algorithm , 2005, Neurocomputing.

[9]  Jonathan Le Roux,et al.  Optimization methods for discriminative training , 2005, INTERSPEECH.

[10]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[11]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[13]  Stephen J. Wright,et al.  Sparse Reconstruction by Separable Approximation , 2008, IEEE Transactions on Signal Processing.

[14]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[15]  Tara N. Sainath,et al.  A generalized family of parameter estimation techniques , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Dai Li A Constrained Line Search Optimization Method for Discriminative Training of HMMs , 2010 .

[17]  Georg Heigold,et al.  A log-linear discriminative modeling framework for speech recognition , 2010 .

[18]  Georg Heigold,et al.  EM-style optimization of hidden conditional random fields for grapheme-to-phoneme conversion , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tanja Schultz,et al.  Generalized Baum-Welch Algorithm and its Implication to a New Extended Baum-Welch Algorithm , 2011, INTERSPEECH.

[20]  Tara N. Sainath,et al.  A-Functions: A generalization of Extended Baum-Welch transformations to convex optimization , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[22]  Georg Heigold,et al.  Equivalence of Generative and Log-Linear Models , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Stephen J. Wright Accelerated Block-coordinate Relaxation for Regularized Optimization , 2012, SIAM J. Optim..