论文信息 - Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models

Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models

Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER. In the present work, we explore techniques to train attention-based models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we find that the proposed training procedure improves performance by up to 8.2% relative to the baseline system. This allows us to train grapheme-based, uni-directional attention-based models which match the performance of a traditional, state-of-the-art, discriminative sequence-trained system on a mobile voice-search task.

[1] Geoffrey Zweig,et al. Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Tara N. Sainath,et al. A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[3] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[5] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[6] Dong Yu,et al. Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Hagen Soltau,et al. Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[8] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[9] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Andrew W. Senior,et al. Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[11] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Rohit Prabhavalkar,et al. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13] Joelle Pineau,et al. An Actor-Critic Algorithm for Structured Prediction , 2016 .

[14] Tara N. Sainath,et al. Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[15] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Tara N. Sainath,et al. Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[17] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[18] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[19] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[20] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[21] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[22] Matt Shannon,et al. Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping , 2017, INTERSPEECH.

[23] Matt Shannon,et al. Optimizing Expected Word Error Rate via Sampling for Speech Recognition , 2017, INTERSPEECH.

[24] Marc'Aurelio Ranzato,et al. Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[25] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[26] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27] Tara N. Sainath,et al. An Analysis of "Attention" in Sequence-to-Sequence Models , 2017, INTERSPEECH.

[28] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[29] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Brian Kingsbury,et al. Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31] Tara N. Sainath,et al. Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[32] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[33] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.