Multitask Training with Text Data for End-to-End Speech Recognition

We propose a multitask training method for attention-based end-to-end speech recognition models to better incorporate language level information. We regularize the decoder in a sequence-to-sequence architecture by multitask training it on both the speech recognition task and a next-token prediction language modeling task. Trained on either the 100 hour subset of LibriSpeech or the full 960 hour dataset, the proposed method leads to an 11% relative performance improvement over the baseline and is comparable to language model shallow fusion, without requiring an additional neural network during decoding. Analyses of sample output sentences and the word error rate on rare words demonstrate that the proposed method can incorporate language level information effectively.

[1]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tara N. Sainath,et al.  Semi-supervised Training for End-to-end Models via Weak Distillation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jia Cui,et al.  Token-wise Training for Attention Based End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[5]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jonathan Le Roux,et al.  Cycle-consistency Training for End-to-end Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jiangyan Yi,et al.  Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.

[8]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[9]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[10]  Yonghui Wu,et al.  ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context , 2020, INTERSPEECH.

[11]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[12]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[13]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[14]  Ramón Fernández Astudillo,et al.  Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text , 2019, INTERSPEECH.

[15]  Rohit Prabhavalkar,et al.  On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.

[16]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[17]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Cyril Allauzen,et al.  Hybrid Autoregressive Transducer (HAT) , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Tatsuya Kawahara,et al.  Distilling the Knowledge of BERT for Sequence-to-Sequence ASR , 2020, INTERSPEECH.

[22]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[23]  Bhuvana Ramabhadran,et al.  Improving Speech Recognition Using Consistent Predictions on Synthesized Speech , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Tara N. Sainath,et al.  An Attention-Based Joint Acoustic and Text on-Device End-To-End Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[26]  Xu Tan,et al.  Almost Unsupervised Text to Speech and Automatic Speech Recognition , 2019, ICML.

[27]  Ehsan Variani,et al.  A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).