Optimizing Expected Word Error Rate via Sampling for Speech Recognition

State-level minimum Bayes risk (sMBR) training has become the de facto standard for sequence-level training of speech recognition acoustic models. It has an elegant formulation using the expectation semiring, and gives large improvements in word error rate (WER) over models trained solely using cross-entropy (CE) or connectionist temporal classification (CTC). sMBR training optimizes the expected number of frames at which the reference and hypothesized acoustic states differ. It may be preferable to optimize the expected WER, but WER does not interact well with the expectation semiring, and previous approaches based on computing expected WER exactly involve expanding the lattices used during training. In this paper we show how to perform optimization of the expected WER by sampling paths from the lattices used during conventional sMBR training. The gradient of the expected WER is itself an expectation, and so may be approximated using Monte Carlo sampling. We show experimentally that optimizing WER during acoustic model training gives 5% relative improvement in WER over a well-tuned sMBR baseline on a 2-channel query recognition task (Google Home).

[1]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[2]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[3]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[4]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[5]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[6]  Mehryar Mohri,et al.  A weight pushing algorithm for large vocabulary speech recognition , 2001, INTERSPEECH.

[7]  Frank Sehnke,et al.  Parameter-exploring policy gradients , 2010, Neural Networks.

[8]  Hans-Andrea Loeliger,et al.  Estimating the partition function of 2-D fields and the capacity of constrained noiseless 2-D channels using tree-based Gibbs sampling , 2009, 2009 IEEE Information Theory Workshop.

[9]  Yu Hu,et al.  Minimum word classification error training of HMMS for automatic speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[11]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[12]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Thomas Hain,et al.  Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition , 2006, INTERSPEECH.

[16]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[18]  Georg Heigold,et al.  Modified MMI/MPE: a direct evaluation of the margin in speech recognition , 2008, ICML '08.

[19]  Jason Eisner,et al.  Parameter Estimation for Probabilistic Finite-State Transducers , 2002, ACL.

[20]  Johan Schalkwyk,et al.  Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Mehryar Mohri,et al.  Speech Recognition with Weighted Finite-State Transducers , 2008 .

[22]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[23]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[24]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  H. Ney,et al.  Minimum exact word error training , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[26]  Brian Kingsbury,et al.  Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[27]  Andreas Stolcke,et al.  Improved discriminative training using phone lattices , 2005, INTERSPEECH.

[28]  Jason Eisner Expectation Semirings: Flexible EM for Learning Finite-State Transducers , 2001 .

[29]  Zdravko Kacic,et al.  A novel loss function for the overall risk criterion based discriminative training of HMM models , 2000, INTERSPEECH.

[30]  Vaibhava Goel,et al.  Minimum Bayes-risk automatic speech recognition , 2000, Comput. Speech Lang..

[31]  Jerome R. Bellegarda,et al.  State of the art in statistical methods for language and speech processing , 2016, Comput. Speech Lang..

[32]  Mark J. F. Gales,et al.  Annotating large lattices with the exact word error , 2015, INTERSPEECH.

[33]  William J. Byrne,et al.  Pinched lattice minimum Bayes risk discriminative training for large vocabulary continuous speech recognition , 2004, INTERSPEECH.