Analysis of Extended Baum–Welch and Constrained Optimization for Discriminative Training of HMMs

Discriminative training is an essential part in building a state-of-the-art speech recognition system. The Extended Baum–Welch (EBW) algorithm is the most popular method to carry out this demanding large-scale optimization task. This paper presents a novel analysis of the EBW algorithm which shows that EBW is performing a specific kind of constrained optimization. The constraints show an interesting connection between the improvement of the discriminative criterion and the Kullback–Leibler divergence (KLD). Based on the analysis, a novel method for controlling the EBW algorithm is proposed. The presented analysis uses decomposed formulae for Gaussian mixture KLDs which correspond to the ones used in the Constrained Line Search (CLS) optimization algorithm. The CLS algorithm for discriminative training is therefore also briefly presented and its connections to EBW studied. Large vocabulary speech recognition experiments are used to evaluate the proposed controlling of EBW, which is shown to outperform the common heuristics in model robustness. Comparison of EBW to CLS also shows differences in robustness in favor to EBW. The constraints for Gaussian parameter optimization as well as the special mixture weight estimation method used with EBW are shown to be the key factors for good performance.

[1]  Hui Jiang,et al.  Discriminative training of HMMs for automatic speech recognition: A survey , 2010, Comput. Speech Lang..

[2]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Janne Pylkkönen Investigations on discriminative training in large scale acoustic model estimation , 2009, INTERSPEECH.

[5]  Andreas Stolcke,et al.  Improved discriminative training using phone lattices , 2005, INTERSPEECH.

[6]  Frank K. Soong,et al.  A Constrained Line Search Optimization Method for Discriminative Training of HMMs , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Renato De Mori,et al.  High-performance connected digit recognition using maximum mutual information estimation , 1994, IEEE Trans. Speech Audio Process..

[8]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[9]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[10]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[11]  Brian Kingsbury,et al.  Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[13]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[14]  Thomas M. Cover,et al.  Elements of Information Theory: Cover/Elements of Information Theory, Second Edition , 2005 .

[15]  Hermann Ney,et al.  Comparison of discriminative training criteria and optimization methods for speech recognition , 2001, Speech Commun..

[16]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Daniel Povey,et al.  Frame discrimination training for HMMs for large vocabulary speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[18]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[20]  Wu Chou,et al.  Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[21]  Scott Axelrod,et al.  Discriminative Estimation of Subspace Constrained Gaussian Mixture Models for Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[23]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[24]  Mitch Weintraub,et al.  The Hub and Spoke Paradigm for CSR Evaluation , 1994, HLT.

[25]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.