Low Latency Speech Recognition Using End-to-End Prefetching

Latency is a crucial metric for streaming speech recognition systems. In this paper, we reduce latency by fetching responses early based on the partial recognition results and refer to it as prefetching. Specifically, prefetching works by submitting partial recognition results for subsequent processing such as obtaining assistant server responses or second-pass rescoring before the recognition result is finalized. If the partial result matches the final recognition result, the early fetched response can be delivered to the user instantly. This effectively speeds up the system by saving the execution latency that typically happens after recognition is completed. Prefetching can be triggered multiple times for a single query, but this leads to multiple rounds of downstream processing and increases the computation costs. It is hence desirable to fetch the result sooner but meanwhile limiting the number of prefetches. To achieve the best trade-off between latency and computation cost, we investigated a series of prefetching decision models including decoder silence based prefetching, acoustic silence based prefetching and end-to-end prefetching. In this paper, we demonstrate the proposed prefetching mechanism reduced latency by ∼200 ms for a system that consists of a streaming first pass model using recurrent neural network transducer and a non-streaming second pass rescoring model using Listen, Attend and Spell. We observe that the endto-end prefetching provides the best trade-off between cost and latency and is 120 ms faster compared to silence based prefetching at a fixed prefetch rate.

[1]  Yun Lei,et al.  All for one: feature combination for highly channel-degraded speech activity detection , 2013, INTERSPEECH.

[2]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[3]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[4]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[5]  Matt Shannon,et al.  Improved End-of-Query Detection for Streaming Speech Recognition , 2017, INTERSPEECH.

[6]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Fabrizio Silvestri,et al.  Prefetching query results and its impact on search engines , 2012, SIGIR '12.

[8]  Yiming Wang,et al.  Far-Field ASR Without Parallel Data , 2016, INTERSPEECH.

[9]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Tara N. Sainath,et al.  Joint Endpointing and Decoding with End-to-end Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[14]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[16]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[17]  Tara N. Sainath,et al.  Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[19]  Brian Kingsbury,et al.  Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[22]  Tara N. Sainath,et al.  Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition , 2017, INTERSPEECH.

[23]  Tara N. Sainath,et al.  Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Quoc V. Le,et al.  Specaugment on Large Scale Datasets , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.