Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination

While post-training model compression can greatly reduce the inference cost of a deep neural network, 5 uncompressed training still consumes a huge amount of hardware resources, run-time and energy. It 6 is highly desirable to directly train a compact neural network from scratch with low memory and low 7 computational cost. Low-rank tensor decomposition is one of the most effective approaches to reduce 8 the memory and computing requirements of large-size neural networks. However, directly training 9 a low-rank tensorized neural network is a very challenging task because it is hard to determine 10 a proper tensor rank a priori, which controls the model complexity and compression ratio in the 11 training process. This paper presents a novel end-to-end framework for low-rank tensorized training 12 of neural networks. We first develop a flexible Bayesian model that can handle various low-rank 13 tensor formats (e.g., CP, Tucker, tensor train and tensor-train matrix) that compress neural network 14 parameters in training. This model can automatically determine the tensor ranks inside a nonlinear 15 forward model, which is beyond the capability of existing Bayesian tensor methods. We further 16 develop a scalable stochastic variational inference solver to estimate the posterior density of large17 scale problems in training. Our work provides the first general-purpose rank-adaptive framework 18 for end-to-end tensorized training. Our numerical results on various neural network architectures 19 show orders-of-magnitude parameter reduction and little accuracy loss (or even better accuracy) in 20 the training process. Specifically, on a very large deep learning recommendation system with over 21 4.2× 10 model parameters, our method can reduce the variables to only 1.6× 10 automatically in 22 the training process (i.e., by 2.6 × 10 times) while achieving almost the same accuracy. 23

[1]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[2]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[3]  Swagath Venkataramani,et al.  Ultra-Low Precision 4-bit Training of Deep Neural Networks , 2020, NeurIPS.

[4]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[5]  Pavlo Molchanov,et al.  Importance Estimation for Neural Network Pruning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Donald Goldfarb,et al.  Robust Low-Rank Tensor Recovery: Models and Algorithms , 2013, SIAM J. Matrix Anal. Appl..

[7]  Mathieu Salzmann,et al.  Compression-aware Training of Deep Networks , 2017, NIPS.

[8]  Tsui-Wei Weng,et al.  Big-Data Tensor Recovery for High-Dimensional Uncertainty Quantification of Process Variations , 2017, IEEE Transactions on Components, Packaging and Manufacturing Technology.

[9]  Chunhua Deng,et al.  TIE: Energy-efficient Tensor Train-based Inference Engine for Deep Neural Network , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[10]  Yuangang Wang,et al.  A Highly Parallel and Energy Efficient Three-Dimensional Multilayer CMOS-RRAM Accelerator for Tensorized Neural Network , 2018, IEEE Transactions on Nanotechnology.

[11]  Liang Xiao,et al.  Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning , 2017, NIPS.

[12]  Dmitry P. Vetrov,et al.  Structured Bayesian Pruning via Log-Normal Multiplicative Noise , 2017, NIPS.

[13]  H. T. Kung,et al.  Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[14]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15]  Myunghee Cho Paik,et al.  Uncertainty quantification using Bayesian neural networks in classification: Application to biomedical image segmentation , 2020, Comput. Stat. Data Anal..

[16]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[17]  Liqing Zhang,et al.  Bayesian Robust Tensor Factorization for Incomplete Multiway Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Vivienne Sze,et al.  Hardware for machine learning: Challenges and opportunities , 2017, 2017 IEEE Custom Integrated Circuits Conference (CICC).

[19]  Alexander Novikov,et al.  Ultimate tensorization: compressing convolutional and FC layers alike , 2016, ArXiv.

[20]  Christopher J. Hillar,et al.  Most Tensor Problems Are NP-Hard , 2009, JACM.

[21]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[22]  R. Harshman,et al.  PARAFAC: parallel factor analysis , 1994 .

[23]  Eunhyeok Park,et al.  Weighted-Entropy-Based Quantization for Deep Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Zheng Zhang,et al.  Variational Bayesian Inference for Robust Streaming Tensor Factorization and Completion , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[25]  Hongtu Zhu,et al.  Tensor Regression with Applications in Neuroimaging Data Analysis , 2012, Journal of the American Statistical Association.

[26]  Bin Liu,et al.  Fast video facial expression recognition by deeply tensor-compressed LSTM neural network on mobile device , 2019, SEC.

[27]  Danilo P. Mandic,et al.  Compression and Interpretability of Deep Neural Networks via Tucker Tensor Layer: From First Principles to Tensor Valued Back-Propagation , 2019 .

[28]  Bingbing Ni,et al.  Variational Convolutional Neural Network Pruning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[30]  Yifan Gong,et al.  Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.

[31]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[32]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[33]  Greg Mori,et al.  Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[35]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[36]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[37]  Valentin Khrulkov,et al.  Tensorized Embedding Layers for Efficient Model Compression , 2019, ArXiv.

[38]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[39]  David B. Dunson,et al.  Bayesian Tensor Regression , 2015, J. Mach. Learn. Res..

[40]  Luca Cardelli,et al.  Uncertainty Quantification with Statistical Guarantees in End-to-End Autonomous Driving Control , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Soumya Ghosh,et al.  Model Selection in Bayesian Neural Networks via Horseshoe Priors , 2017, J. Mach. Learn. Res..

[42]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[43]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[44]  B. Recht,et al.  Tensor completion and low-n-rank tensor recovery via convex optimization , 2011 .

[45]  Lorien Y. Pratt,et al.  Comparing Biases for Minimal Network Construction with Back-Propagation , 1988, NIPS.

[46]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[48]  Zheng Zhang,et al.  Tucker Tensor Decomposition on FPGA , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[49]  Gonzalo Mateos,et al.  Rank Regularization and Bayesian Inference for Tensor Completion and Extrapolation , 2013, IEEE Transactions on Signal Processing.

[50]  M. Wand,et al.  Mean field variational bayes for elaborate distributions , 2011 .

[51]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[52]  Shinichi Nakajima,et al.  Analysis of Empirical MAP and Empirical Partially Bayes: Can They be Alternatives to Variational Bayes? , 2014, AISTATS.

[53]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[54]  Ivan V. Oseledets,et al.  Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[55]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[56]  Jieping Ye,et al.  Tensor Completion for Estimating Missing Values in Visual Data , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Gilbert Lazard Wider and deeper , 2007 .

[58]  Xin Wang,et al.  Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[59]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[60]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[61]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[62]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[63]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[64]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[65]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[66]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[67]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[68]  George Karypis,et al.  Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[69]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[70]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[71]  Jasper Snoek,et al.  Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors , 2020, ICML.

[72]  Radford M. Neal Bayesian Learning via Stochastic Dynamics , 1992, NIPS.

[73]  Liqing Zhang,et al.  Bayesian CP Factorization of Incomplete Tensors with Automatic Rank Determination , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[75]  Chunfeng Cui,et al.  Tensor Methods for Generating Compact Uncertainty Quantification and Deep Learning Models , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[76]  Zheng Zhang,et al.  Bayesian Tensorized Neural Networks with Automatic Rank Selection , 2019, Neurocomputing.

[77]  Yifan Sun,et al.  Wide Compression: Tensor Ring Nets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.