High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition. To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks, and incorporates future context frames to get more information for accurate acoustic modeling. We further improve the training strategy with sequence-level teacher-student learning. To obtain low latency, we design a two-head cltLSTM, in which one head has zero latency and the other head has a small latency, compared to an LSTM. When trained with Microsoft’s 65 thousand hours of anonymized training data and evaluated with test sets with 1.8 million words, the proposed two-head cltLSTM model with the proposed training strategy yields a 28.2% relative WER reduction over the conventional LSTM acoustic model, with a similar perceived latency.

[1]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[3]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[4]  Yifan Gong,et al.  Layer Trajectory LSTM , 2018, INTERSPEECH.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Thomas Hain,et al.  Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition , 2006, INTERSPEECH.

[7]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yu Zhang,et al.  A prioritized grid long short-term memory RNN for speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[9]  Chengyi Wang,et al.  Semantic Mask for Transformer based End-to-End Speech Recognition , 2020, INTERSPEECH.

[10]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11]  Yu Wang,et al.  General Sequence Teacher–Student Learning , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Liang Lu,et al.  Improving Layer Trajectory LSTM with Future Context Frames , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jonathan Le Roux,et al.  Triggered Attention for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[15]  Geoffrey Zweig,et al.  Exploring multidimensional lstms for large vocabulary ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Yifan Gong,et al.  Layer Trajectory BLSTM , 2019, INTERSPEECH.

[17]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[18]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Yongqiang Wang,et al.  Simplifying long short-term memory acoustic models for fast training and decoding , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.

[21]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[22]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[23]  Matt Shannon,et al.  Optimizing Expected Word Error Rate via Sampling for Speech Recognition , 2017, INTERSPEECH.

[24]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[25]  Naoyuki Kanda,et al.  Investigation of lattice-free maximum mutual information-based acoustic models with sequence-level Kullback-Leibler divergence , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[26]  Mark J. F. Gales,et al.  Sequence Student-Teacher Training of Deep Neural Networks , 2016, INTERSPEECH.

[27]  Sanjeev Khudanpur,et al.  A Teacher-Student Learning Approach for Unsupervised Domain Adaptation of Sequence-Trained ASR Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[28]  Shuang Xu,et al.  Multidimensional Residual Learning Based on Recurrent Neural Networks for Acoustic Modeling , 2016, INTERSPEECH.

[29]  Geoffrey Zweig,et al.  LSTM time and frequency recurrence for automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[30]  Shinji Watanabe,et al.  Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration , 2019, INTERSPEECH.

[31]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[32]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[33]  Jungwon Lee,et al.  Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition , 2017, INTERSPEECH.

[34]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[35]  Tara N. Sainath,et al.  Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks , 2016, INTERSPEECH.