Good Initializations of Variational Bayes for Deep Models

Stochastic variational inference is an established way to carry out approximate Bayesian inference for deep models. While there have been effective proposals for good initializations for loss minimization in deep learning, far less attention has been devoted to the issue of initialization of stochastic variational inference. We address this by proposing a novel layer-wise initialization strategy based on Bayesian linear models. The proposed method is extensively validated on regression and classification tasks, including Bayesian DeepNets and ConvNets, showing faster and better convergence compared to alternatives inspired by the literature on initializations for loss minimization.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Zoubin Ghahramani,et al.  Probabilistic machine learning and artificial intelligence , 2015, Nature.

[3]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[6]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[7]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[8]  Guodong Zhang,et al.  Noisy Natural Gradient as Variational Inference , 2017, ICML.

[9]  Yangqing Jia,et al.  Learning Semantic Image Representations at a Large Scale , 2014 .

[10]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[11]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[12]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[13]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[14]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[15]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[16]  Ryan P. Adams,et al.  Variational Boosting: Iteratively Refining Posterior Approximations , 2016, ICML.

[17]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[18]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[19]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[20]  Zoubin Ghahramani,et al.  Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference , 2015, ArXiv.

[21]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[22]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[25]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[26]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[27]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[28]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[29]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[30]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[31]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[32]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[33]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[34]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[35]  Max Welling,et al.  Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors , 2016, ICML.

[36]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[39]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[40]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[41]  Daniele Venzano,et al.  Flexible Scheduling of Distributed Analytic Applications , 2016, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[42]  Peter A. Flach Classifier Calibration , 2017, Encyclopedia of Machine Learning and Data Mining.

[43]  David M. Blei,et al.  Deep Exponential Families , 2014, AISTATS.

[44]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[45]  Alexandre Lacoste,et al.  Neural Autoregressive Flows , 2018, ICML.

[46]  Matthew D. Hoffman,et al.  On the challenges of learning with inference networks on sparse, high-dimensional data , 2017, AISTATS.

[47]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[48]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[49]  Lorenzo Rosasco,et al.  Dirichlet-based Gaussian Processes for Large-scale Calibrated Classification , 2018, NeurIPS.

[50]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[51]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[52]  Stephan Mandt,et al.  Quasi-Monte Carlo Variational Inference , 2018, ICML.

[53]  Alex Kendall,et al.  Concrete Dropout , 2017, NIPS.

[54]  Max Welling,et al.  Multiplicative Normalizing Flows for Variational Bayesian Neural Networks , 2017, ICML.

[55]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[56]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[57]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[58]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).