论文信息 - Streaming End-to-end Speech Recognition for Mobile Devices

Streaming End-to-end Speech Recognition for Mobile Devices

End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recog-nizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.

[1] Francoise Beaufays,et al. “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[2] Tara N. Sainath,et al. An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Alexander Gutkin,et al. Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer , 2016, INTERSPEECH.

[4] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[6] Hagen Soltau,et al. Reducing the computational complexity for whole word models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[7] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Tara N. Sainath,et al. Contextual Speech Recognition in End-to-end Neural Network Systems Using Beam Search , 2018, INTERSPEECH.

[9] Sercan Ömer Arik,et al. Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[10] Wei Li,et al. Streaming small-footprint keyword spotting using sequence-to-sequence models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11] Navdeep Jaitly,et al. An RNN Model of Text Normalization , 2017, INTERSPEECH.

[12] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14] Fernando Pereira,et al. Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[15] Brian Roark,et al. Composition-based on-the-fly rescoring for salient n-gram biasing , 2015, INTERSPEECH.

[16] Tara N. Sainath,et al. Deep Context: End-to-end Contextual Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[17] Tara N. Sainath,et al. Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[18] Brian Roark,et al. Bringing contextual information to google speech recognition , 2015, INTERSPEECH.

[19] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Keith B. Hall,et al. Sequence-based class tagging for robust transcription in ASR , 2015, INTERSPEECH.

[21] Ian McGraw,et al. Personalized speech recognition on mobile devices , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Quoc V. Le,et al. Listen, Attend and Spell , 2015, ArXiv.

[23] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[24] Rohit Prabhavalkar,et al. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[25] Tara N. Sainath,et al. Improving the efficiency of forward-backward algorithm using batched computation in TensorFlow , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[26] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[28] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[29] Jordan Cohen,et al. Embedded speech recognition applications in mobile phones: Status, trends, and challenges , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[31] Hagen Soltau,et al. Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[32] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[33] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[34] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[35] Alvarez Raziel,et al. End-to-end Streaming Keyword Spotting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36] Tara N. Sainath,et al. Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[37] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[38] Rohit Prabhavalkar,et al. On the Efficient Representation and Execution of Deep Acoustic Models , 2016, INTERSPEECH.

[39] Tara N. Sainath,et al. Semi-supervised Training for End-to-end Models via Weak Distillation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] Tanja Schultz,et al. Speechalator: two-way speech-to-speech translation on a consumer PDA , 2003, INTERSPEECH.

[41] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.