On Modern Deep Learning and Variational Inference

Bayesian modelling and variational inference are rooted in Bayesian statistics, and easily benefit from the vast literature in the field. In contrast, deep learning lacks a solid mathematical grounding. Instead, empirical developments in deep learning are often justified by metaphors, evading the unexplained principles at play. It is perhaps astonishing then that most modern deep learning models can be cast as performing approximate variational inference in a Bayesian setting. This mathematically grounded result, studied in Gal and Ghahramani [1] for deep neural networks (NNs), is extended here to arbitrary deep learning models. The implications of this statement are profound: we can use the rich Bayesian statistics literature with deep learning models, explain away many of the curiosities with these, combine results from deep learning into Bayesian modelling, and much more. We demonstrate the practical impact of the framework with image classification by combining Bayesian and deep learning techniques, obtaining new state-of-the-art results, and survey open problems to research. These stand at the forefront of a new and exciting field combining modern deep learning and Bayesian techniques.

[1]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Zoubin Ghahramani,et al.  Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference , 2015, ArXiv.

[4]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[5]  Charles M. Bishop,et al.  Ensemble learning in Bayesian neural networks , 1998 .

[6]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[7]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[8]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[11]  Yg,et al.  Dropout as a Bayesian Approximation : Insights and Applications , 2015 .

[12]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[13]  Richard E. Turner,et al.  Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs , 2015, ICML.

[14]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[15]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[16]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .