Interpreting a Penalty as the Influence of a Bayesian Prior

In machine learning, it is common to optimize the parameters of a probabilistic model, modulated by a somewhat ad hoc regularization term that penalizes some values of the parameters. Regularization terms appear naturally in Variational Inference (VI), a tractable way to approximate Bayesian posteriors: the loss to optimize contains a Kullback--Leibler divergence term between the approximate posterior and a Bayesian prior. We fully characterize which regularizers can arise this way, and provide a systematic way to compute the corresponding prior. This viewpoint also provides a prediction for useful values of the regularization factor in neural networks. We apply this framework to regularizers such as L1 or group-Lasso.

[1]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[2]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[3]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[4]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[5]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[6]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[7]  Zhiqiang Shen,et al.  Learning Efficient Convolutional Networks through Network Slimming , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  E. Stein,et al.  Introduction to Fourier Analysis on Euclidean Spaces. , 1971 .

[9]  Geoffrey E. Hinton,et al.  Keeping Neural Networks Simple , 1993 .

[10]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[13]  Danilo Comminiello,et al.  Group sparse regularization for deep neural networks , 2016, Neurocomputing.

[14]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[15]  L. Hörmander The analysis of linear partial differential operators , 1990 .

[16]  Samy Bengio,et al.  Are All Layers Created Equal? , 2019, J. Mach. Learn. Res..

[17]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[18]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[19]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[20]  Yann Ollivier,et al.  Natural Langevin Dynamics for Neural Networks , 2017, GSI.

[21]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[22]  E. Stein,et al.  Introduction to Fourier analysis on Euclidean spaces (PMS-32) , 1972 .

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[25]  Lena Vogler,et al.  Handbook Of Integral Equations , 2016 .

[26]  Mathieu Salzmann,et al.  Learning the Number of Neurons in Deep Networks , 2016, NIPS.

[27]  David J. C. MacKay,et al.  Bayesian Model Comparison and Backprop Nets , 1991, NIPS.