Reducing the Computational Complexity of Two-Dimensional LSTMs

Long Short-Term Memory Recurrent Neural Networks (LSTMs) are good at modeling temporal variations in speech recognition tasks, and have become an integral component of many state-of-the-art ASR systems. More recently, LSTMs have been extended to model variations in the speech signal in two dimensions, namely time and frequency [1, 2]. However, one of the problems with two-dimensional LSTMs, such as Grid-LSTMs, is that the processing in both time and frequency occurs sequentially, thus increasing computational complexity. In this work, we look at minimizing the dependence of the Grid-LSTM with respect to previous time and frequency points in the sequence, thus reducing computational complexity. Specifically, we compare reducing computation using a bidirectional Grid-LSTM (biGrid-LSTM) with non-overlapping frequency sub-band processing, a PyraMiD-LSTM [3] and a frequency-block Grid-LSTM (fbGrid-LSTM) for parallel time-frequency processing. We find that the fbGrid-LSTM can reduce computation costs by a factor of four with no loss in accuracy, on a 12,500 hour Voice Search task.

[1]  Jürgen Schmidhuber,et al.  Multi-dimensional Recurrent Neural Networks , 2007, ICANN.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Jürgen Schmidhuber,et al.  Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation , 2015, NIPS.

[4]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tara N. Sainath,et al.  Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks , 2016, INTERSPEECH.

[6]  Geoffrey Zweig,et al.  Exploring multidimensional lstms for large vocabulary ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jont B. Allen How do humans process and recognize speech , 1993 .

[8]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.

[9]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[10]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[11]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[12]  Yoshua Bengio,et al.  ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks , 2015, ArXiv.

[13]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.