论文信息 - Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination

Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination

While post-training model compression can greatly reduce the inference cost of a deep neural network, 5 uncompressed training still consumes a huge amount of hardware resources, run-time and energy. It 6 is highly desirable to directly train a compact neural network from scratch with low memory and low 7 computational cost. Low-rank tensor decomposition is one of the most effective approaches to reduce 8 the memory and computing requirements of large-size neural networks. However, directly training 9 a low-rank tensorized neural network is a very challenging task because it is hard to determine 10 a proper tensor rank a priori, which controls the model complexity and compression ratio in the 11 training process. This paper presents a novel end-to-end framework for low-rank tensorized training 12 of neural networks. We first develop a flexible Bayesian model that can handle various low-rank 13 tensor formats (e.g., CP, Tucker, tensor train and tensor-train matrix) that compress neural network 14 parameters in training. This model can automatically determine the tensor ranks inside a nonlinear 15 forward model, which is beyond the capability of existing Bayesian tensor methods. We further 16 develop a scalable stochastic variational inference solver to estimate the posterior density of large17 scale problems in training. Our work provides the first general-purpose rank-adaptive framework 18 for end-to-end tensorized training. Our numerical results on various neural network architectures 19 show orders-of-magnitude parameter reduction and little accuracy loss (or even better accuracy) in 20 the training process. Specifically, on a very large deep learning recommendation system with over 21 4.2× 10 model parameters, our method can reduce the variables to only 1.6× 10 automatically in 22 the training process (i.e., by 2.6 × 10 times) while achieving almost the same accuracy. 23

Zheng Zhang | Xing Liu | Cole Hawkins

[1] J. Chang,et al. Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[2] James G. Scott,et al. The horseshoe estimator for sparse signals , 2010 .

[3] Swagath Venkataramani,et al. Ultra-Low Precision 4-bit Training of Deep Neural Networks , 2020, NeurIPS.

[4] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.

[5] Pavlo Molchanov,et al. Importance Estimation for Neural Network Pruning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Donald Goldfarb,et al. Robust Low-Rank Tensor Recovery: Models and Algorithms , 2013, SIAM J. Matrix Anal. Appl..

[7] Mathieu Salzmann,et al. Compression-aware Training of Deep Networks , 2017, NIPS.

[8] Tsui-Wei Weng,et al. Big-Data Tensor Recovery for High-Dimensional Uncertainty Quantification of Process Variations , 2017, IEEE Transactions on Components, Packaging and Manufacturing Technology.

[9] Chunhua Deng,et al. TIE: Energy-efficient Tensor Train-based Inference Engine for Deep Neural Network , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[10] Yuangang Wang,et al. A Highly Parallel and Energy Efficient Three-Dimensional Multilayer CMOS-RRAM Accelerator for Tensorized Neural Network , 2018, IEEE Transactions on Nanotechnology.

[11] Liang Xiao,et al. Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning , 2017, NIPS.

[12] Dmitry P. Vetrov,et al. Structured Bayesian Pruning via Log-Normal Multiplicative Noise , 2017, NIPS.

[13] H. T. Kung,et al. Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[14] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15] Myunghee Cho Paik,et al. Uncertainty quantification using Bayesian neural networks in classification: Application to biomedical image segmentation , 2020, Comput. Stat. Data Anal..

[16] Alexander Novikov,et al. Tensorizing Neural Networks , 2015, NIPS.

[17] Liqing Zhang,et al. Bayesian Robust Tensor Factorization for Incomplete Multiway Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[18] Vivienne Sze,et al. Hardware for machine learning: Challenges and opportunities , 2017, 2017 IEEE Custom Integrated Circuits Conference (CICC).

[19] Alexander Novikov,et al. Ultimate tensorization: compressing convolutional and FC layers alike , 2016, ArXiv.

[20] Christopher J. Hillar,et al. Most Tensor Problems Are NP-Hard , 2009, JACM.

[21] Ran El-Yaniv,et al. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[22] R. Harshman,et al. PARAFAC: parallel factor analysis , 1994 .

[23] Eunhyeok Park,et al. Weighted-Entropy-Based Quantization for Deep Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Zheng Zhang,et al. Variational Bayesian Inference for Robust Streaming Tensor Factorization and Completion , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[25] Hongtu Zhu,et al. Tensor Regression with Applications in Neuroimaging Data Analysis , 2012, Journal of the American Statistical Association.

[26] Bin Liu,et al. Fast video facial expression recognition by deeply tensor-compressed LSTM neural network on mobile device , 2019, SEC.

[27] Danilo P. Mandic,et al. Compression and Interpretability of Deep Neural Networks via Tucker Tensor Layer: From First Principles to Tensor Valued Back-Propagation , 2019 .

[28] Bingbing Ni,et al. Variational Convolutional Neural Network Pruning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Lin Xu,et al. Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[30] Yifan Gong,et al. Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.

[31] Ivan Oseledets,et al. Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[32] Pablo A. Parrilo,et al. Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[33] Greg Mori,et al. Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34] Yoshua Bengio,et al. Training deep neural networks with low precision multiplications , 2014 .

[35] Michael I. Jordan,et al. An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[36] Tamara G. Kolda,et al. Tensor Decompositions and Applications , 2009, SIAM Rev..

[37] Valentin Khrulkov,et al. Tensorized Embedding Layers for Efficient Model Compression , 2019, ArXiv.

[38] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[39] David B. Dunson,et al. Bayesian Tensor Regression , 2015, J. Mach. Learn. Res..

[40] Luca Cardelli,et al. Uncertainty Quantification with Statistical Guarantees in End-to-End Autonomous Driving Control , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[41] Soumya Ghosh,et al. Model Selection in Bayesian Neural Networks via Horseshoe Priors , 2017, J. Mach. Learn. Res..

[42] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.

[43] Alexander M. Rush,et al. Sequence-Level Knowledge Distillation , 2016, EMNLP.

[44] B. Recht,et al. Tensor completion and low-n-rank tensor recovery via convex optimization , 2011 .

[45] Lorien Y. Pratt,et al. Comparing Biases for Minimal Network Construction with Back-Propagation , 1988, NIPS.

[46] Ebru Arisoy,et al. Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47] Yiran Chen,et al. Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[48] Zheng Zhang,et al. Tucker Tensor Decomposition on FPGA , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[49] Gonzalo Mateos,et al. Rank Regularization and Bayesian Inference for Tensor Completion and Extrapolation , 2013, IEEE Transactions on Signal Processing.

[50] M. Wand,et al. Mean field variational bayes for elaborate distributions , 2011 .

[51] Chong Wang,et al. Stochastic variational inference , 2012, J. Mach. Learn. Res..

[52] Shinichi Nakajima,et al. Analysis of Empirical MAP and Empirical Partially Bayes: Can They be Alternatives to Variational Bayes? , 2014, AISTATS.

[53] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[54] Ivan V. Oseledets,et al. Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[55] L. Tucker,et al. Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[56] Jieping Ye,et al. Tensor Completion for Estimating Missing Values in Visual Data , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57] Gilbert Lazard. Wider and deeper , 2007 .

[58] Xin Wang,et al. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[59] Nikos D. Sidiropoulos,et al. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[60] Daniel Brand,et al. Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[61] Julien Cornebise,et al. Weight Uncertainty in Neural Networks , 2015, ArXiv.

[62] Charles Blundell,et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[63] Eunhyeok Park,et al. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[64] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[65] Misha Denil,et al. Predicting Parameters in Deep Learning , 2014 .

[66] Yinghai Lu,et al. Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[67] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[68] George Karypis,et al. Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[69] Andrew Gordon Wilson,et al. A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[70] Geoffrey E. Hinton,et al. Bayesian Learning for Neural Networks , 1995 .

[71] Jasper Snoek,et al. Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors , 2020, ICML.

[72] Radford M. Neal. Bayesian Learning via Stochastic Dynamics , 1992, NIPS.

[73] Liqing Zhang,et al. Bayesian CP Factorization of Incomplete Tensors with Automatic Rank Determination , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74] Dilin Wang,et al. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[75] Chunfeng Cui,et al. Tensor Methods for Generating Compact Uncertainty Quantification and Deep Learning Models , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[76] Zheng Zhang,et al. Bayesian Tensorized Neural Networks with Automatic Rank Selection , 2019, Neurocomputing.

[77] Yifan Sun,et al. Wide Compression: Tensor Ring Nets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.