Learning Sparse Structured Ensembles with SG-MCMC and Network Pruning

An ensemble of neural networks is known to be more robust and accurate than an individual network, however usually with linearly-increased cost in both training and testing. In this work, we propose a two-stage method to learn Sparse Structured Ensembles (SSEs) for neural networks. In the first stage, we run SG-MCMC with group sparse priors to draw an ensemble of samples from the posterior distribution of network parameters. In the second stage, we apply weight-pruning to each sampled network and then perform retraining over the remained connections. In this way of learning SSEs with SG-MCMC and pruning, we not only achieve high prediction accuracy since SG-MCMC enhances exploration of the model-parameter space, but also reduce memory and computation cost significantly in both training and testing of NN ensembles. This is thoroughly evaluated in the experiments of learning SSE ensembles of both FNNs and LSTMs. For example, in LSTM based language modeling (LM), we obtain 21% relative reduction in LM perplexity by learning a SSE of 4 large LSTM models, which has only 30% of model parameters and 70% of computations in total, as compared to the baseline large LSTM LM. To the best of our knowledge, this work represents the first methodology and empirical study of integrating SG-MCMC, group sparse prior and network pruning together for learning NN ensembles.

[1]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Rui Peng,et al.  Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures , 2016, ArXiv.

[4]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[5]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[6]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[7]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[8]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[9]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Mark W. Schmidt,et al.  Group Sparse Priors for Covariance Estimation , 2009, UAI.

[12]  Mark J. van der Laan,et al.  The relative performance of ensemble methods with deep convolutional neural networks for image classification , 2017, Journal of applied statistics.

[13]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[14]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[15]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[16]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[17]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[18]  Danilo Comminiello,et al.  Group sparse regularization for deep neural networks , 2016, Neurocomputing.

[19]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[20]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[21]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[22]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[23]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[24]  Kilian Q. Weinberger,et al.  Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[25]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[26]  Vivek Rathod,et al.  Bayesian dark knowledge , 2015, NIPS.

[27]  Shinichi Nakajima,et al.  Bayesian Group-Sparse Modeling and Variational Inference , 2014, IEEE Transactions on Signal Processing.

[28]  Hiroshi Nakagawa,et al.  Approximation Analysis of Stochastic Gradient Langevin Dynamics by using Fokker-Planck Equation and Ito Process , 2014, ICML.

[29]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[30]  Yee Whye Teh,et al.  Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics , 2014, J. Mach. Learn. Res..

[31]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[32]  Masashi Sugiyama,et al.  Bayesian Dark Knowledge , 2015 .

[33]  Phil Blunsom,et al.  Optimizing Performance of Recurrent Neural Networks on GPUs , 2016, ArXiv.

[34]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[35]  Fang Liu,et al.  Learning Intrinsic Sparse Structures within Long Short-term Memory , 2017, ICLR.

[36]  Mathieu Salzmann,et al.  Learning the Number of Neurons in Deep Networks , 2016, NIPS.

[37]  Zhe Gan,et al.  Scalable Bayesian Learning of Recurrent Neural Networks for Language Modeling , 2016, ACL.