Multistate Encoding with End-To-End Speech RNN Transducer Network

Recurrent Neural Network Transducer (RNN-T) models [1] for automatic speech recognition (ASR) provide high accuracy speech recognition. Such end-to-end (E2E) models combine acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network, dramatically reducing complexity and model size.In this paper, we propose a technique for incorporating contextual signals, such as intelligent assistant device state or dialog state, directly into RNN-T models. We explore different encoding methods and demonstrate that RNN-T models can effectively utilize such context. Our technique results in reduction in Word Error Rate (WER) of up to 10.4% relative on a variety of contextual recognition tasks. We also demonstrate that proper regularization can be used to model context independently for improved overall quality.

[1]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[4]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[5]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[6]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[10]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tara N. Sainath,et al.  Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Kyunghyun Cho,et al.  Dynamic Meta-Embeddings for Improved Sentence Representations , 2018, EMNLP.

[14]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[16]  Petar S. Aleksic,et al.  Unsupervised context learning for speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[17]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Arun Narayanan,et al.  Toward Domain-Invariant Speech Recognition via Large Scale Training , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[19]  Tara N. Sainath,et al.  Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model , 2019, INTERSPEECH.

[20]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[21]  Brian Roark,et al.  Bringing contextual information to google speech recognition , 2015, INTERSPEECH.

[22]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[23]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Tara N. Sainath,et al.  Shallow-Fusion End-to-End Contextual Biasing , 2019, INTERSPEECH.

[25]  Tara N. Sainath,et al.  Deep Context: End-to-end Contextual Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[26]  Tara N. Sainath,et al.  Contextual Speech Recognition in End-to-end Neural Network Systems Using Beam Search , 2018, INTERSPEECH.