论文信息 - Advancing Sequence-to-Sequence Based Speech Recognition

Advancing Sequence-to-Sequence Based Speech Recognition

The paper presents our endeavor to improve state-of-the-art speech recognition results using attention based neural network approaches. Our test focus was LibriSpeech, a well-known, publicly available, large, speech corpus, but the methodologies are clearly applicable to other tasks. After systematic application of standard techniques – sophisticated data augmentation, various dropout schemes, scheduled sampling, warm-restart –, and optimizing search configurations, our model achieves 4.0% and 11.7% word error rate (WER) on the test-clean and testother sets, without any external language model. A powerful recurrent language model drops the error rate further to 2.7% and 8.2%. Thus, we not only report the lowest sequence-tosequence model based numbers on this task to date, but our single system even challenges the best result known in the literature, namely a hybrid model together with recurrent language model rescoring. A simple ROVER combination of several of our attention based systems achieved 2.5% and 7.3% WER on the clean and other test sets.

[1] Hervé Bourlard,et al. Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[2] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[3] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[4] Jun Wang,et al. Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition , 2018, INTERSPEECH.

[5] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[7] Patrick Nguyen,et al. Model Unit Exploration for Sequence-to-Sequence Speech Recognition , 2019, ArXiv.

[8] Samy Bengio,et al. N-gram Language Modeling using Recurrent Neural Network Estimation , 2017, ArXiv.

[9] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[10] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[11] Mohammad Norouzi,et al. Optimal Completion Distillation for Sequence Learning , 2018, ICLR.

[12] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.

[14] Brian Kingsbury,et al. Sequence Noise Injected Training for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Sanjeev Khudanpur,et al. End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[16] Jonathan G. Fiscus,et al. A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[17] Yann LeCun,et al. Regularization of Neural Networks using DropConnect , 2013, ICML.

[18] Yoshua Bengio,et al. On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[19] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[20] Nicolas Usunier,et al. Fully Convolutional Speech Recognition , 2018, ArXiv.

[21] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[22] Kyu J. Han,et al. The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[23] Yang Liu,et al. Modeling Coverage for Neural Machine Translation , 2016, ACL.

[24] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[26] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[27] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[28] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[29] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Tara N. Sainath,et al. Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Hermann Ney,et al. Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[32] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[33] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[34] Hermann Ney,et al. Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[35] Navdeep Jaitly,et al. Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[36] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[37] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[38] Hermann Ney,et al. Investigation on LSTM Recurrent N-gram Language Models for Speech Recognition , 2018, INTERSPEECH.

[39] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[40] Phil Blunsom,et al. Recurrent Continuous Translation Models , 2013, EMNLP.

[41] Sanjeev Khudanpur,et al. Improving LF-MMI Using Unconstrained Supervisions for ASR , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[42] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[43] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).