Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Recently, very deep networks, with as many as hundreds of layers, have shown great success in image classification tasks. One key component that has enabled such deep models is the use of “skip connections”, including either residual or highway connections, to alleviate the vanishing and exploding gradient problems. While these connections have been explored for speech, they have mainly been explored for feed-forward networks. Since recurrent structures, such as LSTMs, have produced state-of-the-art results on many of our Voice Search tasks, the goal of this work is to thoroughly investigate different approaches to adding depth to recurrent structures. Specifically, we experiment with novel Highway-LSTM models with bottlenecks skip connections and show that a 10 layer model can outperform a state-of-the-art 5 layer LSTM model with the same number of parameters by 2% relative WER. In addition, we experiment with Recurrent Highway layers and find these to be on par with Highway-LSTM models, when given sufficient depth.

[1]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[6]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[7]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[8]  Izhak Shafran,et al.  Context dependent phone models for LSTM RNN acoustic modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jasha Droppo,et al.  Linearly augmented deep neural network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[12]  Yu Zhang,et al.  Exploiting Depth and Highway Connections in Convolutional Recurrent Deep Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[13]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[15]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  Liang Lu,et al.  Small-Footprint Deep Neural Networks with Highway Connections for Speech Recognition , 2015, INTERSPEECH.

[18]  Liang Lu Sequence training and adaptation of highway deep neural networks , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[19]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  政子 鶴岡,et al.  1998 IEEE International Conference on SMCに参加して , 1998 .