论文信息 - Grow and Prune Compact, Fast, and Accurate LSTMs

Grow and Prune Compact, Fast, and Accurate LSTMs

Long short-term memory (LSTM) has been widely used for sequential data modeling. Researchers have increased LSTM depth by stacking LSTM cells to improve performance. This incurs model redundancy, increases run-time delay, and makes the LSTMs more prone to overfitting. To address these problems, we propose a hidden-layer LSTM (H-LSTM) that adds hidden layers to LSTM's original one-level nonlinear control gates. H-LSTM increases accuracy while employing fewer external stacked layers, thus reducing the number of parameters and run-time latency significantly. We employ grow-and-prune (GP) training to iteratively adjust the hidden layers through gradient-based growth and magnitude-based pruning of connections. This learns both the weights and the compact architecture of H-LSTM control gates. We have GP-trained H-LSTMs for image captioning, speech recognition, and neural machine translation applications. For the NeuralTalk architecture on the MSCOCO dataset, our three models reduce the number of parameters by 38.7× [floating-point operations (FLOPs) by 45.5×], run-time latency by 4.5×, and improve the CIDEr-D score by 2.8 percent, respectively. For the DeepSpeech2 architecture on the AN4 dataset, the first model we generated reduces the number of parameters by 19.4× and run-time latency by 37.4 percent. The second model reduces the word error rate (WER) from 12.9 to 8.7 percent. For the encoder-decoder sequence-to-sequence network on the IWSLT 2014 German-English dataset, the first model we generated reduces the number of parameters by 10.8× and run-time latency by 14.2 percent. The second model increases the BLEU score from 30.02 to 30.98. Thus, GP-trained H-LSTMs can be seen to be compact, fast, and accurate.

[1] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2] Misha Denil,et al. Predicting Parameters in Deep Learning , 2014 .

[3] Niraj K. Jha,et al. Simultaneously ensuring smartness, security, and energy efficiency in Internet-of-Things sensors , 2018, 2018 IEEE Custom Integrated Circuits Conference (CICC).

[4] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[5] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[6] Song Han,et al. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Alexander M. Rush,et al. OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[12] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[13] Navdeep Jaitly,et al. Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14] Heiga Zen,et al. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[16] Yu Zhang,et al. Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[18] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[19] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[20] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21] Jungwon Lee,et al. Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition , 2017, INTERSPEECH.

[22] Niraj K. Jha,et al. ChamNet: Towards Efficient Network Design Through Platform-Aware Model Adaptation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Yixin Chen,et al. Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[24] Xiangang Li,et al. Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition , 2014, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Dong Yu,et al. Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27] Yanzhi Wang,et al. A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers , 2018, ECCV.

[28] Wojciech Zaremba,et al. Recurrent Neural Network Regularization , 2014, ArXiv.

[29] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[30] Joan Bruna,et al. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[31] Andrew Zisserman,et al. Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[32] Niraj K. Jha,et al. A Hierarchical Inference Model for Internet-of-Things , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[33] Ming Yang,et al. Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[34] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Tianqi Chen,et al. Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[36] Razvan Pascanu,et al. How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[37] Tara N. Sainath,et al. Learning compact recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[39] Jürgen Schmidhuber,et al. Deep learning in neural networks: An overview , 2014, Neural Networks.

[40] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .

[41] Heiga Zen,et al. Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices , 2016, INTERSPEECH.

[42] Yoshua Bengio,et al. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[43] Vikram Mullachery,et al. Image Captioning , 2018, ArXiv.

[44] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[45] Niraj K. Jha,et al. Smart, Secure, Yet Energy-Efficient, Internet-of-Things Sensors , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[46] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[47] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[48] Song Han,et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks , 2016, ICLR.

[49] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.