A Comparison of End-to-End Models for Long-Form Speech Recognition

End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1], [2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.

[1]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[3]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Tara N. Sainath,et al.  Improving the efficiency of forward-backward algorithm using batched computation in TensorFlow , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Tara N. Sainath,et al.  Compression of End-to-End Models , 2018, INTERSPEECH.

[6]  Arun Narayanan,et al.  Toward Domain-Invariant Speech Recognition via Large Scale Training , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[7]  Khe Chai Sim,et al.  Efficient Implementation of Recurrent Neural Network Transducer in Tensorflow , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[8]  Wei Li,et al.  Monotonic Infinite Lookback Attention for Simultaneous Machine Translation , 2019, ACL.

[9]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[11]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[12]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[14]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[16]  Satoshi Nakamura,et al.  Local Monotonic Attention Mechanism for End-to-End Speech And Language Processing , 2017, IJCNLP.

[17]  Colin Raffel,et al.  Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[18]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[19]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[20]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[21]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[22]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).