Improving Discriminative Training for Robust Acoustic Models in Large Vocabulary Continuous Speech Recognition

This paper studies the robustness of discriminatively trained acoustic models for large vocabulary continuous speech recognition. Popular discriminative criteria maximum mutual information (MMI), minimum phone error (MPE), and minimum phone frame error (MPFE), are used in the experiments, which include realistic mismatched conditions from Finnish Speecon corpus and English Wall Street Journal corpus. A simple regularization method for discriminative training is proposed and it is shown to improve the robustness of acoustic models gaining consistent improvements in noisy conditions.

[1]  Jonathan G. Fiscus,et al.  1993 Benchmark Tests for the ARPA Spoken Language Program , 1994, HLT.

[2]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Li Deng,et al.  Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features , 2004, IEEE Transactions on Speech and Audio Processing.

[4]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..

[5]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[6]  Brian Kingsbury,et al.  Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Jonathan G. Fiscus,et al.  Benchmark Tests for the DARPA Spoken Language Program , 1993, HLT.

[8]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[9]  Janne Pylkkönen Investigations on discriminative training in large scale acoustic model estimation , 2009, INTERSPEECH.

[10]  Mark J. F. Gales,et al.  Improved cross-task recognition using MMIE training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[12]  Andreas Stolcke,et al.  Improved discriminative training using phone lattices , 2005, INTERSPEECH.

[13]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Haizhou Li,et al.  A Study on the Generalization Capability of Acoustic Models for Robust Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Yuhong Yang Elements of Information Theory (2nd ed.). Thomas M. Cover and Joy A. Thomas , 2008 .