Grow and Prune Compact, Fast, and Accurate LSTMs

Long short-term memory (LSTM) has been widely used for sequential data modeling. Researchers have increased LSTM depth by stacking LSTM cells to improve performance. This incurs model redundancy, increases run-time delay, and makes the LSTMs more prone to overfitting. To address these problems, we propose a hidden-layer LSTM (H-LSTM) that adds hidden layers to LSTM's original one-level nonlinear control gates. H-LSTM increases accuracy while employing fewer external stacked layers, thus reducing the number of parameters and run-time latency significantly. We employ grow-and-prune (GP) training to iteratively adjust the hidden layers through gradient-based growth and magnitude-based pruning of connections. This learns both the weights and the compact architecture of H-LSTM control gates. We have GP-trained H-LSTMs for image captioning, speech recognition, and neural machine translation applications. For the NeuralTalk architecture on the MSCOCO dataset, our three models reduce the number of parameters by 38.7× [floating-point operations (FLOPs) by 45.5×], run-time latency by 4.5×, and improve the CIDEr-D score by 2.8 percent, respectively. For the DeepSpeech2 architecture on the AN4 dataset, the first model we generated reduces the number of parameters by 19.4× and run-time latency by 37.4 percent. The second model reduces the word error rate (WER) from 12.9 to 8.7 percent. For the encoder-decoder sequence-to-sequence network on the IWSLT 2014 German-English dataset, the first model we generated reduces the number of parameters by 10.8× and run-time latency by 14.2 percent. The second model increases the BLEU score from 30.02 to 30.98. Thus, GP-trained H-LSTMs can be seen to be compact, fast, and accurate.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[3]  Niraj K. Jha,et al.  Simultaneously ensuring smartness, security, and energy efficiency in Internet-of-Things sensors , 2018, 2018 IEEE Custom Integrated Circuits Conference (CICC).

[4]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[5]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[6]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[12]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[13]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[16]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[18]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[19]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Jungwon Lee,et al.  Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition , 2017, INTERSPEECH.

[22]  Niraj K. Jha,et al.  ChamNet: Towards Efficient Network Design Through Platform-Aware Model Adaptation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[24]  Xiangang Li,et al.  Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition , 2014, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Dong Yu,et al.  Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Yanzhi Wang,et al.  A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers , 2018, ECCV.

[28]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[29]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[30]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[31]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[32]  Niraj K. Jha,et al.  A Hierarchical Inference Model for Internet-of-Things , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[33]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[34]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[36]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[37]  Tara N. Sainath,et al.  Learning compact recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[39]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[40]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[41]  Heiga Zen,et al.  Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices , 2016, INTERSPEECH.

[42]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[43]  Vikram Mullachery,et al.  Image Captioning , 2018, ArXiv.

[44]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[45]  Niraj K. Jha,et al.  Smart, Secure, Yet Energy-Efficient, Internet-of-Things Sensors , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[46]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[47]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[48]  Song Han,et al.  DSD: Dense-Sparse-Dense Training for Deep Neural Networks , 2016, ICLR.

[49]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[50]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[51]  Song Han,et al.  DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow , 2016, ArXiv.

[52]  Niraj K. Jha,et al.  NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm , 2017, IEEE Transactions on Computers.

[53]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[54]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[55]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[56]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[57]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).