论文信息 - Integrating Source-Channel and Attention-Based Sequence-to-Sequence Models for Speech Recognition

Integrating Source-Channel and Attention-Based Sequence-to-Sequence Models for Speech Recognition

This paper proposes a novel automatic speech recognition (ASR) framework called Integrated Source-Channel and Attention (ISCA) that combines the advantages of traditional systems based on the noisy source-channel model (SC) and end-to-end style systems using attention-based sequence-to-sequence models. The traditional SC system framework includes hidden Markov models and connectionist temporal classification (CTC) based acoustic models, language models (LMs), and a decoding procedure based on a lexicon, whereas the end-to-end style attention-based system jointly models the whole process with a single model. By rescoring the hypotheses produced by traditional systems using end-to-end style systems based on an extended noisy source-channel model, ISCA allows structured knowledge to be easily incorporated via the SC-based model while exploiting the complementarity of the attention-based model. Experiments on the AMI meeting corpus show that ISCA is able to give a relative word error rate reduction up to 21% over an individual system, and by 13% over an alternative method which also involves combining CTC and attention-based models.

[1] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Philip C. Woodland,et al. PyHTK: Python Library and ASR Pipelines for HTK , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Peter Bell,et al. Windowed Attention Mechanisms for Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Satoshi Nakamura,et al. Sequence-to-Sequence Asr Optimization Via Reinforcement Learning , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[6] John S. Bridle,et al. Alpha-nets: A recurrent 'neural' network architecture with a hidden Markov model interpretation , 1990, Speech Commun..

[7] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[8] Jan Niehues,et al. Very Deep Self-Attention Networks for End-to-End Speech Recognition , 2019, INTERSPEECH.

[9] Frederick Jelinek,et al. Statistical methods for speech recognition , 1997 .

[10] Shinji Watanabe,et al. Auxiliary Feature Based Adaptation of End-to-end ASR Systems , 2018, INTERSPEECH.

[11] Geoffrey Zweig,et al. Toward Human Parity in Conversational Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12] Gunnar Evermann,et al. Posterior probability decoding, confidence estimation and system combination , 2000 .

[13] Sanjeev Khudanpur,et al. A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[14] Yu Zhang,et al. Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Shinji Watanabe,et al. Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Jonathan G. Fiscus,et al. A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[17] Andrew W. Senior,et al. Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[18] Tara N. Sainath,et al. A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[19] Mohammad Norouzi,et al. Optimal Completion Distillation for Sequence Learning , 2018, ICLR.

[20] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[22] Daniel Povey,et al. Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[23] Shigeru Katagiri,et al. Speaker Adaptation for Multichannel End-to-End Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Sanjeev Khudanpur,et al. End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[25] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[27] Steve Renals,et al. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition , 2015, INTERSPEECH.

[28] Richard Socher,et al. Improving End-to-End Speech Recognition with Policy Learning , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[30] Daniel Povey,et al. Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31] Dong Yu,et al. Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Andrew W. Senior,et al. Off-line Cursive Handwriting Recognition using Recurrent Neural Networks , 1994 .

[33] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[34] Yu Zhang,et al. Latent Sequence Decompositions , 2016, ICLR.

[35] Colin Raffel,et al. Monotonic Chunkwise Attention , 2017, ICLR.

[36] Yifan Gong,et al. Speaker Adaptation for End-to-End CTC Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[37] Dong Yu,et al. Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38] C. Zhang,et al. Improved Tdnns Using Deep Kernels and Frequency Dependent Grid-RNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Yongqiang Wang,et al. Two Efficient Lattice Rescoring Methods Using Recurrent Neural Network Language Models , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40] Tomoki Toda,et al. Back-Translation-Style Data Augmentation for end-to-end ASR , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[41] Jun Wang,et al. Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[42] Hermann Ney,et al. RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[43] John R. Hershey,et al. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[44] Tomohiro Nakatani,et al. SEQUENCE TRAINING OF ENCODER-DECODER MODEL USING POLICY GRADIENT FOR END- TO-END SPEECH RECOGNITION , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Tara N. Sainath,et al. Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48] Johan Schalkwyk,et al. Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[50] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[51] Yonghong Yan,et al. An Exploration of Dropout with LSTMs , 2017, INTERSPEECH.

[52] Shuang Xu,et al. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53] Shuang Xu,et al. Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese , 2018, INTERSPEECH.

[54] Steve J. Young,et al. Tree-Based State Tying for High Accuracy Modelling , 1994, HLT.

[55] George Saon,et al. Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[56] Liang Lu,et al. On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57] Shinji Watanabe,et al. End-to-end Speech Recognition With Word-Based Rnn Language Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[58] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[59] Tara N. Sainath,et al. No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[61] Yongqiang Wang,et al. Simplifying long short-term memory acoustic models for fast training and decoding , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63] Tara N. Sainath,et al. Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[64] Hermann Ney,et al. CTC in the Context of Generalized Full-Sum HMM Training , 2017, INTERSPEECH.

[65] Ying Zhang,et al. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[66] Jean Carletta,et al. The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.