Emitting Word Timings with End-to-End Models

Having end-to-end (E2E) models emit the start and end times of words on-device is important for various applications. This unsolved problem presents challenges with respect to model size, latency and accuracy. In this paper, we present an approach to word timings by constraining the attention head of the Listen, Attend, Spell (LAS) 2nd-pass rescorer [1]. On a Voice-Search task, we show that this approach does not degrade accuracy compared to when no attention head is constrained. In addition, it meets on-device size and latency constraints. In comparison, constraining the alignment with a 1st-pass Recurrent Neural Network Transducer (RNN-T) model to emit word timings results in quality degradation. Furthermore, a low-frame-rate conventional acoustic model [2], which is trained with a constrained alignment and is used in many applications for word timings, is slower to detect start and end times compared to our proposed 2nd-pass LAS approach.

[1]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[3]  Tara N. Sainath,et al.  Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[6]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[7]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[8]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[10]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[14]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[15]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[18]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[19]  Tara N. Sainath,et al.  Shallow-Fusion End-to-End Contextual Biasing , 2019, INTERSPEECH.

[20]  Lemao Liu,et al.  On the Word Alignment from Neural Machine Translation , 2019, ACL.