论文信息 - Attention-based sequence-to-sequence model for speech recognition: development of state-of-the-art system on LibriSpeech and its application to non-native English

Attention-based sequence-to-sequence model for speech recognition: development of state-of-the-art system on LibriSpeech and its application to non-native English

Recent research has shown that attention-based sequence-to-sequence models such as Listen, Attend, and Spell (LAS) yield comparable results to state-of-the-art ASR systems on various tasks. In this paper, we describe the development of such a system and demonstrate its performance on two tasks: first we achieve a new state-of-the-art word error rate of 3.43% on the test clean subset of LibriSpeech English data; second on non-native English speech, including both read speech and spontaneous speech, we obtain very competitive results compared to a conventional system built with the most updated Kaldi recipe.

[1] Xuerui Yang,et al. A novel pyramidal-FSMN architecture with lattice-free MMI for speech recognition , 2018, ArXiv.

[2] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[3] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Tara N. Sainath,et al. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Yoshua Bengio,et al. On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[6] Tara N. Sainath,et al. A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[7] Moustapha Cissé,et al. Efficient softmax approximation for GPUs , 2016, ICML.

[8] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Jun Wang,et al. Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition , 2018, INTERSPEECH.

[10] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[11] Lukás Burget,et al. Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[12] Gerald Penn,et al. Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Paul Deléglise,et al. TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[14] Quoc V. Le,et al. Listen, Attend and Spell , 2015, ArXiv.

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[17] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[18] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Brian Kingsbury,et al. Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20] Hermann Ney,et al. Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[21] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[22] Matt Shannon,et al. Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping , 2017, INTERSPEECH.

[23] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[24] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[25] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26] Samy Bengio,et al. An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.