Multitask Training with Text Data for End-to-End Speech Recognition

We propose a multitask training method for attention-based end-to-end speech recognition models to better incorporate language level information. We regularize the decoder in a sequence-to-sequence architecture by multitask training it on both the speech recognition task and a next-token prediction language modeling task. Trained on either the 100 hour subset of LibriSpeech or the full 960 hour dataset, the proposed method leads to an 11% relative performance improvement over the baseline and is comparable to language model shallow fusion, without requiring an additional neural network during decoding. Analyses of sample output sentences and the word error rate on rare words demonstrate that the proposed method can incorporate language level information effectively.

[1]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[2]  Dong Yu,et al.  Large Margin Training for Attention Based End-to-End Speech Recognition , 2019, INTERSPEECH.

[3]  Tara N. Sainath,et al.  Semi-supervised Training for End-to-end Models via Weak Distillation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[5]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[6]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jonathan Le Roux,et al.  Cycle-consistency Training for End-to-end Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tatsuya Kawahara,et al.  Distilling the Knowledge of BERT for Sequence-to-Sequence ASR , 2020, INTERSPEECH.

[9]  Jiangyan Yi,et al.  Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.

[10]  Rohit Prabhavalkar,et al.  On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.

[11]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[13]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[15]  Cyril Allauzen,et al.  Hybrid Autoregressive Transducer (HAT) , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[17]  Xu Tan,et al.  Almost Unsupervised Text to Speech and Automatic Speech Recognition , 2019, ICML.

[18]  Tara N. Sainath,et al.  An Attention-Based Joint Acoustic and Text on-Device End-To-End Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Ehsan Variani,et al.  A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Ramón Fernández Astudillo,et al.  Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text , 2019, INTERSPEECH.

[21]  Tara N. Sainath,et al.  Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[23]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Tara N. Sainath,et al.  A Better and Faster end-to-end Model for Streaming ASR , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Yonghui Wu,et al.  ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context , 2020, INTERSPEECH.

[26]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Jia Cui,et al.  Token-wise Training for Attention Based End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[29]  Jun Wang,et al.  Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[30]  Bhuvana Ramabhadran,et al.  Improving Speech Recognition Using Consistent Predictions on Synthesized Speech , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Gabriel Synnaeve,et al.  Iterative Pseudo-Labeling for Speech Recognition , 2020, INTERSPEECH.

[32]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).