Ensemble Learning for Multi-Layer Networks

Bayesian treatments of learning in neural networks are typically based either on local Gaussian approximations to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called ensemble learning, was introduced by Hinton and van Camp (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior and a parametric approximating distribution. However, the derivation of a deterministic algorithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and so was unable to capture the posterior correlations between parameters. In this paper, we show how the ensemble learning approach can be extended to full-covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparameters, leading to a simple re-estimation procedure. Initial results from a standard benchmark problem are encouraging.

[1]  J. Jensen Sur les fonctions convexes et les inégalités entre les valeurs moyennes , 1906 .

[2]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[3]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[4]  H. Rauch Solutions to the linear smoothing problem , 1963 .

[5]  C. Striebel,et al.  On the maximum likelihood estimates for linear dynamic systems , 1965 .

[6]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[7]  O. Seeberg Statistical Mechanics. — A Set of Lectures , 1975 .

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  G. Torrie,et al.  Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling , 1977 .

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  Temple F. Smith Occam's razor , 1980, Nature.

[12]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[13]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[14]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[15]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[16]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[17]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[18]  A. O'Hagan,et al.  Bayes–Hermite quadrature , 1991 .

[19]  Geoffrey E. Hinton,et al.  Mean field networks that learn to discriminate temporally distorted strings , 1991 .

[20]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[21]  James O. Berger,et al.  Ockham's Razor and Bayesian Analysis , 1992 .

[22]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[23]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[24]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[25]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[26]  C. Robert,et al.  Bayesian estimation of hidden Markov chains: a stochastic implementation , 1993 .

[27]  Geoffrey E. Hinton,et al.  Keeping Neural Networks Simple , 1993 .

[28]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[29]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[30]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[32]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[33]  Steve R. Waterhouse,et al.  Bayesian Methods for Mixtures of Experts , 1995, NIPS.

[34]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[35]  David J. C. MacKay,et al.  Developments in Probabilistic Modelling with Neural Networks - Ensemble Learning , 1995, SNN Symposium on Neural Networks.

[36]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[37]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[38]  Michael I. Jordan,et al.  Hidden Markov Decision Trees , 1996, NIPS.

[39]  David Bruce Wilson,et al.  Exact sampling with coupled Markov chains and applications to statistical mechanics , 1996, Random Struct. Algorithms.

[40]  James Allen Fill,et al.  An interruptible algorithm for perfect sampling via Markov chains , 1997, STOC '97.

[41]  Michael I. Jordan,et al.  Variational methods for inference and estimation in graphical models , 1997 .

[42]  Neil D. Lawrence,et al.  Approximating Posterior Distributions in Belief Networks Using Mixtures , 1997, NIPS.

[43]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[44]  P. Saama MAXIMUM LIKELIHOOD AND BAYESIAN METHODS FOR MIXTURES OF NORMAL DISTRIBUTIONS , 1997 .

[45]  David Barber,et al.  On Computing the KL Divergence for Bayesian Neural Networks , 1997 .

[46]  Michael I. Jordan,et al.  Probabilistic Independence Networks for Hidden Markov Probability Models , 1997, Neural Computation.

[47]  T. Jaakkola,et al.  Improving the Mean Field Approximation Via the Use of Mixture Distributions , 1999, Learning in Graphical Models.

[48]  Christopher K. I. Williams,et al.  DTs: Dynamic Trees , 1998, NIPS.

[49]  Jim Q. Smith,et al.  On the Geometry of Bayesian Graphical Models with Hidden Variables , 1998, UAI.

[50]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[51]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[52]  Ross D. Shachter Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams) , 1998, UAI.

[53]  Radford M. Neal Assessing Relevance determination methods using DELVE , 1998 .

[54]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[55]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[56]  Neil D. Lawrence,et al.  Mixture Representations for Inference and Learning in Boltzmann Machines , 1998, UAI.

[57]  P. Green,et al.  Exact Sampling from a Continuous State Space , 1998 .

[58]  William D. Penny,et al.  Bayesian Approaches to Gaussian Mixture Modeling , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[59]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[60]  G. Casella,et al.  Perfect Slice Samplers for Mixtures of Distributions , 1999 .

[61]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[62]  Harri Lappalainen,et al.  Ensemble learning for independent component analysis , 1999 .

[63]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[64]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[65]  David J. C. MacKay,et al.  Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[66]  Neil D. Lawrence,et al.  A Variational B ayesian Committee of Neural Networks , 1999 .