Improving latency-controlled BLSTM acoustic models for online speech recognition

Bidirectional long short-term memory (BLSTM) recurrent neural networks are powerful acoustic models in terms of recognition accuracy. When BLSTM acoustic models are used in decoding, the speech decoder needs to wait until the end of a whole sentence is reached, such that forward-propagation in the backward direction can then be performed. The nature of BLSTM acoustic models makes them inappropriate for real-time online speech recognition because of the latency issue. Recently, the context-sensitive-chunk BLSTM and latency-controlled BLSTM acoustic models have been proposed, both chop a whole sentence into several overlapping chunks. By appending several left and/or right contextual frames, forward-propagation of BLSTM can be down within a controlled time delay, while the recognition accuracy is maintained when comparing with conventional BLSTM models. In this paper, two improved versions of latency-controlled BLSTM acoustic models are presented. By using different types of neural network topology to initialize the BLSTM memory cell states, we aim at reducing the computational cost introduced by the contextual frames and enabling faster online recognition. Experimental results on a 320-hour Switchboard task have shown that the improved versions accelerate from 24% to 61% in decoding without significant loss in recognition accuracy.

[1]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[2]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Daniel Povey,et al.  Revisiting Recurrent Neural Networks for robust ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Björn W. Schuller,et al.  A multi-stream ASR framework for BLSTM modeling of conversational speech , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[7]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[8]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[9]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[10]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[11]  Zhi-Jie Yan,et al.  A context-sensitive-chunk BPTT approach to training deep LSTM/BLSTM recurrent neural networks for offline handwriting recognition , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[12]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13]  Kai Chen,et al.  Training Deep Bidirectional LSTM Acoustic Model for LVCSR by a Context-Sensitive-Chunk BPTT Approach , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Hermann Ney,et al.  Towards Online-Recognition with Deep Bidirectional LSTM Acoustic Models , 2016, INTERSPEECH.

[15]  Hermann Ney,et al.  Towards Online-Recognition with Deep Bidirectional LSTM Acoustic Models , 2016, INTERSPEECH.

[16]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[17]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.