Gradient conjugate priors and multi-layer neural networks

The paper deals with learning probability distributions of observed data by artificial neural networks. We suggest a so-called gradient conjugate prior (GCP) update appropriate for neural networks, which is a modification of the classical Bayesian update for conjugate priors. We establish a connection between the gradient conjugate prior update and the maximization of the log-likelihood of the predictive distribution. Unlike for the Bayesian neural networks, we use deterministic weights of neural networks, but rather assume that the ground truth distribution is normal with unknown mean and variance and learn by the neural networks the parameters of a prior (normal-gamma distribution) for these unknown mean and variance. The update of the parameters is done, using the gradient that, at each step, directs towards minimizing the Kullback--Leibler divergence from the prior to the posterior distribution (both being normal-gamma). We obtain a corresponding dynamical system for the prior's parameters and analyze its properties. In particular, we study the limiting behavior of all the prior's parameters and show how it differs from the case of the classical full Bayesian update. The results are validated on synthetic and real world data sets.

[1]  Aki Vehtari,et al.  Expectation propagation for neural networks with sparsity-promoting priors , 2013, J. Mach. Learn. Res..

[2]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[3]  Daniel Hernández-Lobato,et al.  Black-Box Alpha Divergence Minimization , 2015, ICML.

[4]  R. D. Veaux,et al.  Prediction intervals for neural networks via nonlinear regression , 1998 .

[5]  Max Welling,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS 2015.

[6]  Richard E. Turner,et al.  Rényi Divergence Variational Inference , 2016, NIPS.

[7]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[8]  J. Gerritsma,et al.  Geometry, resistance and stability of the Delft Systematic Yacht hull series , 1981 .

[9]  D. M. Titterington,et al.  Bayesian Methods for Neural Networks and Related Models , 2004 .

[10]  Finale Doshi-Velez,et al.  Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning , 2017, ICML.

[11]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[12]  J. T. Hwang,et al.  Prediction Intervals for Artificial Neural Networks , 1997 .

[13]  Hannes Stuke,et al.  Learning uncertainty in regression tasks by deep neural networks , 2017, ArXiv.

[14]  M. C. Jones,et al.  Robust and efficient estimation by minimising a density power divergence , 1998 .

[15]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[16]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[17]  Pınar Tüfekci,et al.  Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods , 2014 .

[18]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[19]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[20]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[21]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[22]  A. Sima Etaner-Uyar,et al.  Comparison of Evolutionary Techniques for Value-at-Risk Calculation , 2007, EvoWorkshops.

[23]  Max Welling,et al.  Multiplicative Normalizing Flows for Variational Bayesian Neural Networks , 2017, ICML.

[24]  I-Cheng Yeh,et al.  Modeling of strength of high-performance concrete using artificial neural networks , 1998 .

[25]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[26]  Manfred Opper,et al.  A Bayesian Approach to Online Learning , 2006 .

[27]  Masashi Sugiyama,et al.  Variational Inference based on Robust Divergences , 2017, AISTATS.

[28]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[29]  Andre Lucas Outlier robust unit root analysis , 1996 .

[30]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[31]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[32]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[33]  Simon Julier,et al.  Posterior distribution analysis for Bayesian inference in neural networks , 2016 .

[34]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[35]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[36]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[37]  J. Soch,et al.  Kullback-Leibler Divergence for the Normal-Gamma Distribution , 2016, 1611.01437.

[38]  A. Basu,et al.  Robust Bayes estimation using the density power divergence , 2016 .

[39]  A. Weigend,et al.  Estimating the mean and variance of the target probability distribution , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[40]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[41]  Amir F. Atiya,et al.  Comprehensive Review of Neural Network-Based Prediction Intervals and New Advances , 2011, IEEE Transactions on Neural Networks.

[42]  Yarin Gal,et al.  Dropout Inference in Bayesian Neural Networks with Alpha-divergences , 2017, ICML.

[43]  Samuel Kotz,et al.  Estimation Methods for the Multivariate t Distribution , 2008 .