Deliberation Model Based Two-Pass End-To-End Speech Recognition

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional model, our best model performs 21% relatively better for VS. In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding.

[1]  Lin-Shan Lee,et al.  Towards End-to-end Speech-to-text Translation with Two-pass Decoding , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[3]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[4]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[6]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[9]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[10]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tara N. Sainath,et al.  Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Brian Roark,et al.  Neural Models of Text Normalization for Speech Applications , 2019, Computational Linguistics.

[15]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Arun Narayanan,et al.  Toward Domain-Invariant Speech Recognition via Large Scale Training , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[17]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[19]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Shiliang Zhang,et al.  Investigation of Transformer Based Spelling Correction Model for CTC-Based End-to-End Mandarin Speech Recognition , 2019, INTERSPEECH.

[22]  Tara N. Sainath,et al.  Improving Performance of End-to-End ASR on Numeric Sequences , 2019, INTERSPEECH.

[23]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[24]  Nenghai Yu,et al.  Deliberation Networks: Sequence Generation Beyond One-Pass Decoding , 2017, NIPS.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Hermann Ney,et al.  RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[27]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[28]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).