Optimization methods for discriminative training

Discriminative training applied to hidden Markov model (HMM) design can yield significant benefits in recognition accuracy and model compactness. However, compared to Maximum Likelihood based methods, discriminative training typically requires much more computation, as all competing candidates must be considered, not just the correct one. The choice of the algorithm used to optimize the discriminative criterion function is thus a key issue. We investigated several algorithms and used them for discriminative training based on the Minimum Classification Error (MCE) framework. In particular, we examined on-line, batch, and semi-batch Probabilistic Descent (PD), as well as Quickprop, Rprop and BFGS.We describe each algorithm and present comparative results on the TIMIT phone classification task and on the 230 hour Corpus of Spontaneous Japanese (CSJ) 30K word continuous speech recognition task.

[1]  Mark J. F. Gales,et al.  Training LVCSR systems on thousands of hours of data , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[3]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[4]  Kazumi Saito,et al.  Partial BFGS Update and Efficient Step-Length Calculation for Three-Layer Neural Networks , 1997, Neural Computation.

[5]  Erik McDermott,et al.  Minimum classification error training of landmark models for real-time continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  John E. Dennis,et al.  On the Local and Superlinear Convergence of Quasi-Newton Methods , 1973 .

[7]  Wu Chou,et al.  minimum classification error linear regression for acoustic model adaptation of continuous density HMMS , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[8]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[9]  Biing-Hwang Juang,et al.  A new algorithm for fast discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[11]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[12]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[13]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[14]  Tatsuya Kawahara Benchmark test for speech recognition using the Corpus of Spontaneous Japanese , 2003 .

[15]  Biing-Hwang Juang,et al.  New discriminative training algorithms based on the generalized probabilistic descent method , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.