Sparse Deep Learning: A New Framework Immune to Local Traps and Miscalibration

Deep learning has powered recent successes of artificial intelligence (AI). However, the deep neural network, as the basic model of deep learning, has suffered from issues such as local traps and miscalibration. In this paper, we provide a new framework for sparse deep learning, which has the above issues addressed in a coherent way. In particular, we lay down a theoretical foundation for sparse deep learning and propose prior annealing algorithms for learning sparse neural networks. The former has successfully tamed the sparse deep neural network into the framework of statistical modeling, enabling prediction uncertainty correctly quantified. The latter can be asymptotically guaranteed to converge to the global optimum, enabling the validity of the down-stream statistical inference. Numerical result indicates the superiority of the proposed method compared to the existing ones.

[1]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[2]  F. Liang,et al.  Bayesian Subset Modeling for High-Dimensional Generalized Linear Models , 2013 .

[3]  Martin Jaggi,et al.  Dynamic Model Pruning with Feedback , 2020, ICLR.

[4]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[5]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[6]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[7]  Helmut Bölcskei,et al.  Optimal Approximation with Sparsely Connected Deep Neural Networks , 2017, SIAM J. Math. Data Sci..

[8]  N. Simon,et al.  Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification , 2017, 1711.07592.

[9]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[10]  C. Fefferman Reconstructing a neural net from its output , 1994 .

[11]  Luke Zettlemoyer,et al.  Sparse Networks from Scratch: Faster Training without Losing Performance , 2019, ArXiv.

[12]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[13]  Veronika Rockova,et al.  Posterior Concentration for Sparse Deep Learning , 2018, NeurIPS.

[14]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[15]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Alberto Tesi,et al.  On the Problem of Local Minima in Backpropagation , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  Veronika Rocková,et al.  Uncertainty Quantification for Sparse Deep Learning , 2020, AISTATS.

[21]  Allan Pinkus,et al.  Lower bounds for approximation by MLP neural networks , 1999, Neurocomputing.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[24]  Lawrence Carin,et al.  On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators , 2015, NIPS.

[25]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[26]  Edward I. George,et al.  Variable selection for BART: An application to gene regulation , 2013, 1310.4887.

[27]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[28]  S. Portnoy Asymptotic Behavior of Likelihood Methods for Exponential Families when the Number of Parameters Tends to Infinity , 1988 .

[29]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[30]  F. Liang,et al.  Bayesian Neural Networks for Selection of Drug Sensitive Genes , 2018, Journal of the American Statistical Association.

[31]  Faming Liang,et al.  Consistent Sparse Deep Learning: Theory and Computation , 2021, Journal of the American Statistical Association.

[32]  J. Rousseau,et al.  A Bernstein–von Mises theorem for smooth functionals in semiparametric models , 2013, 1305.4482.

[33]  Christopher Nemeth,et al.  Stochastic Gradient Markov Chain Monte Carlo , 2019, Journal of the American Statistical Association.

[34]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[35]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[36]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.