Ensemble learning in Bayesian neural networks

Bayesian treatments of learning in neural networks are typically based either on a local Gaussian approximation to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called ensemble learning, was introduced by Hinton and van Camp (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior and a parametric approximating distribution. The original derivation of a deterministic algorithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and hence was unable to capture the posterior correlations between parameters. In this chapter we show how the ensemble learning approach can be extended to full-covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparameters, leading to a simple re-estimation procedure. One of the benefits of our approach is that it yields a strict lower bound on the marginal likelihood, in contrast to other approximate procedures.

[1]  A. M. Walker On the Asymptotic Behaviour of Posterior Distributions , 1969 .

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  James O. Berger,et al.  Statistical Decision Theory and Bayesian Analysis, Second Edition , 1985 .

[4]  A. Kennedy,et al.  Hybrid Monte Carlo , 1987 .

[5]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[6]  Chris Bishop,et al.  Current address: Microsoft Research, , 2022 .

[7]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[8]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[9]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[10]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[11]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[12]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[13]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[14]  David Barber,et al.  Ensemble Learning for Multi-Layer Networks , 1997, NIPS.

[15]  Neil D. Lawrence,et al.  Approximating Posterior Distributions in Belief Networks Using Mixtures , 1997, NIPS.

[16]  David Barber,et al.  Radial Basis Functions: A Bayesian Treatment , 1997, NIPS.

[17]  Christopher M. Bishop Variational Learning in Graphical Models and Neural Networks , 1998 .

[18]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[19]  David Barber,et al.  Tractable Undirected Approximations for Graphical Models , 1998 .

[20]  Neil D. Lawrence,et al.  Mixture Representations for Inference and Learning in Boltzmann Machines , 1998, UAI.