Transformer Based Deliberation for Two-Pass Speech Recognition

Interactive speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that a deliberation network can be an effective second-pass model. The model attends to two kinds of inputs at once: encoded audio frames and the hypothesis text from the first-pass model. In this work, we explore using transformer layers instead of long-short term memory (LSTM) layers for deliberation rescoring. In transformer layers, we generalize the “encoder-decoder” attention to attend to both encoded audio and first-pass text hypotheses. The output context vectors are then combined by a merger layer. Compared to LSTM-based deliberation, our best transformer deliberation achieves 7% relative word error rate improvements along with a 38% reduction in computation. We also compare against non-deliberation transformer rescoring, and find a 9% relative improvement.

[1]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[3]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[6]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[7]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[8]  Tara N. Sainath,et al.  Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tara N. Sainath,et al.  Improving Performance of End-to-End ASR on Numeric Sequences , 2019, INTERSPEECH.

[10]  Kjell Schubert,et al.  Transformer-Transducer: End-to-End Speech Recognition with Self-Attention , 2019, ArXiv.

[11]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[12]  Wei Li,et al.  Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition , 2020, INTERSPEECH.

[13]  Ankur Bapna,et al.  Non-Parametric Adaptation for Neural Machine Translation , 2019, NAACL.

[14]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Yashesh Gaur,et al.  On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition , 2020, INTERSPEECH.

[18]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[21]  Tara N. Sainath,et al.  Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Zhong Meng,et al.  Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability , 2020, INTERSPEECH.

[23]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[24]  Tara N. Sainath,et al.  Joint Endpointing and Decoding with End-to-end Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Shinji Watanabe,et al.  Streaming Transformer ASR with Blockwise Synchronous Inference , 2020, ArXiv.

[26]  Boris Ginsburg,et al.  Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Qian Zhang,et al.  Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Tara N. Sainath,et al.  Deliberation Model Based Two-Pass End-To-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Bhuvana Ramabhadran,et al.  Neural Oracle Search on N-BEST Hypotheses , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Geoffrey Zweig,et al.  Transformer-Based Acoustic Modeling for Hybrid Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[34]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[35]  Nenghai Yu,et al.  Deliberation Networks: Sequence Generation Beyond One-Pass Decoding , 2017, NIPS.

[36]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[37]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[39]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[40]  Jinyu Li,et al.  Exploring Transformers for Large-Scale Speech Recognition , 2020, INTERSPEECH.

[41]  Qi Liu,et al.  Neural Lattice Search for Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Hermann Ney,et al.  The Rwth Asr System for Ted-Lium Release 2: Improving Hybrid Hmm With Specaugment , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).