Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling

Acoustic models used in hidden Markov model/neural-network (HMM/NN) speech recognition systems are usually trained with a frame-based cross-entropy error criterion. In contrast, Gaussian mixture HMM systems are discriminatively trained using sequence-based criteria, such as minimum phone error or maximum mutual information, that are more directly related to speech recognition accuracy. This paper demonstrates that neural-network acoustic models can be trained with sequence classification criteria using exactly the same lattice-based methods that have been developed for Gaussian mixture HMMs, and that using a sequence classification criterion in training leads to considerably better performance. A neural network acoustic model with 153K weights trained on 50 hours of broadcast news has a word error rate of 34.0% on the rt04 English broadcast news test set. When this model is trained with the state-level minimum Bayes risk criterion, the rt04 word error rate is 27.7%.

[1]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  J. S. Bridle,et al.  An Alphanet approach to optimising input transformations for continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Régis Cardin,et al.  MMIE training for large vocabulary continuous speech recognition , 1994, ICSLP.

[4]  Hervé Bourlard,et al.  Neural networks for statistical recognition of continuous speech , 1995, Proc. IEEE.

[5]  Yochai Konig,et al.  Remap: recursive estimation and maximization of a posteriori probabilities in transition-based speech recognition , 1996 .

[6]  Finn Tore Johansen,et al.  A comparison of hybrid HMM architecture using global discriminating training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Anders Krogh,et al.  Hidden Neural Networks , 1999, Neural Computation.

[8]  Daniel P. W. Ellis,et al.  Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Zdravko Kacic,et al.  A novel loss function for the overall risk criterion based discriminative training of HMM models , 2000, INTERSPEECH.

[11]  Daniel Povey,et al.  Improved discriminative training techniques for large vocabulary continuous speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[13]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Thomas Hain,et al.  Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition , 2006, INTERSPEECH.

[15]  Geoffrey Zweig,et al.  The IBM 2006 Gale Arabic ASR System , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[16]  Brian Kingsbury,et al.  Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.