A Bayesian Optimization Framework for Neural Network Compression

Neural network compression is an important step for deploying neural networks where speed is of high importance, or on devices with limited memory. It is necessary to tune compression parameters in order to achieve the desired trade-off between size and performance. This is often done by optimizing the loss on a validation set of data, which should be large enough to approximate the true risk and therefore yield sufficient generalization ability. However, using a full validation set can be computationally expensive. In this work, we develop a general Bayesian optimization framework for optimizing functions that are computed based on U-statistics. We propagate Gaussian uncertainties from the statistics through the Bayesian optimization framework yielding a method that gives a probabilistic approximation certificate of the result. We then apply this to parameter selection in neural network compression. Compression objectives that can be written as U-statistics are typically based on empirical risk and knowledge distillation for deep discriminative models. We demonstrate our method on VGG and ResNet models, and the resulting system can find optimal compression parameters for relatively high-dimensional parametrizations in a matter of minutes on a standard desktop machine, orders of magnitude faster than competing methods.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Philipp Hennig,et al.  Entropy Search for Information-Efficient Global Optimization , 2011, J. Mach. Learn. Res..

[3]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[4]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[5]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[6]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[7]  Gene H. Golub,et al.  Matrix computations , 1983 .

[8]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[9]  Jonas Mockus,et al.  On Bayesian Methods for Seeking the Extremum , 1974, Optimization Techniques.

[10]  R. Serfling Approximation Theorems of Mathematical Statistics , 1980 .

[11]  B. Minasny,et al.  The Matérn function as a general model for soil variograms , 2005 .

[12]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[13]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[14]  Ivan V. Oseledets,et al.  Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[15]  Bo Peng,et al.  Extreme Network Compression via Filter Group Approximation , 2018, ECCV.

[16]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[17]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[18]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[19]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[20]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[21]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[24]  Greg Mori,et al.  Constraint-Aware Deep Neural Network Compression , 2018, ECCV.

[25]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Dacheng Tao,et al.  On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Antoine Cully,et al.  Limbo: A Flexible High-performance Library for Gaussian Processes modeling and Data-Efficient Optimization , 2018, J. Open Source Softw..

[28]  Harold J. Kushner,et al.  A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , 1964 .

[29]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[30]  Joost van de Weijer,et al.  Domain-Adaptive Deep Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Jing Li,et al.  Adaptive Quantization of Neural Networks , 2018, ICLR.

[32]  Arthur Gretton,et al.  A Test of Relative Similarity For Model Selection in Generative Models , 2015, ICLR.

[33]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[34]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[35]  Koji Tsuda,et al.  COMBO: An efficient Bayesian optimization library for materials science , 2016 .

[36]  Christoforos Anagnostopoulos,et al.  Stopping criteria for boosting automatic experimental design using real-time fMRI with Bayesian optimization , 2015 .

[37]  Vincent Lepetit,et al.  Learning Separable Filters , 2013, CVPR.

[38]  Larry S. Davis,et al.  NISP: Pruning Networks Using Neuron Importance Score Propagation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[40]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[41]  Greg Mori,et al.  Fine-Pruning: Joint Fine-Tuning and Compression of a Convolutional Network with Bayesian Optimization , 2017, BMVC.

[42]  Matthew B. Blaschko,et al.  Function Norms and Regularization in Deep Networks , 2017 .

[43]  Peter I. Frazier,et al.  A Tutorial on Bayesian Optimization , 2018, ArXiv.

[44]  Matthias W. Seeger,et al.  Bayesian Optimization with Tree-structured Dependencies , 2017, ICML.

[45]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[46]  Cheng Li,et al.  Regret for Expected Improvement over the Best-Observed Value and Stopping Condition , 2017, ACML.

[47]  Jian Sun,et al.  Accelerating Very Deep Convolutional Networks for Classification and Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Narendra Ahuja,et al.  Coreset-Based Neural Network Compression , 2018, ECCV.

[49]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[50]  Max Welling,et al.  Soft Weight-Sharing for Neural Network Compression , 2017, ICLR.

[51]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[52]  Xiaogang Wang,et al.  Convolutional neural networks with low-rank regularization , 2015, ICLR.

[53]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.