Variational algorithms for approximate Bayesian inference

The Bayesian framework for machine learning allows for the incorporation of prior knowledge in a coherent way, avoids overfitting problems, and provides a principled basis for selecting between alternative models. Unfortunately the computations required are usually intractable. This thesis presents a unified variational Bayesian (VB) framework which approximates these computations in models with latent variables using a lower bound on the marginal likelihood. Chapter 1 presents background material on Bayesian inference, graphical models, and propagation algorithms. Chapter 2 forms the theoretical core of the thesis, generalising the expectationmaximisation (EM) algorithm for learning maximum likelihood parameters to the VB EM algorithm which integrates over model parameters. The algorithm is then specialised to the large family of conjugate-exponential (CE) graphical models, and several theorems are presented to pave the road for automated VB derivation procedures in both directed and undirected graphs (Bayesian and Markov networks, respectively). Chapters 3-5 derive and apply the VB EM algorithm to three commonly-used and important models: mixtures of factor analysers, linear dynamical systems, and hidden Markov models. It is shown how model selection tasks such as determining the dimensionality, cardinality, or number of variables are possible using VB approximations. Also explored are methods for combining sampling procedures with variational approximations, to estimate the tightness of VB bounds and to obtain more effective sampling algorithms. Chapter 6 applies VB learning to a long-standing problem of scoring discrete-variable directed acyclic graphs, and compares the performance to annealed importance sampling amongst other methods. Throughout, the VB approximation is compared to other methods including sampling, Cheeseman-Stutz, and asymptotic approximations such as BIC. The thesis concludes with a discussion of evolving directions for model selection including infinite models and alternative approximations to the marginal likelihood.

[1]  J. Jensen Sur les fonctions convexes et les inégalités entre les valeurs moyennes , 1906 .

[2]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[3]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[4]  H. Rauch Solutions to the linear smoothing problem , 1963 .

[5]  C. Striebel,et al.  On the maximum likelihood estimates for linear dynamic systems , 1965 .

[6]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[7]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[8]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[9]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[10]  R. Feynman Statistical Mechanics, A Set of Lectures , 1972 .

[11]  Lalit R. Bahl,et al.  Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition , 1975, IEEE Trans. Inf. Theory.

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  G. Torrie,et al.  Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling , 1977 .

[14]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[15]  S. Adler Over-relaxation method for the Monte Carlo evaluation of the partition function for multiquadratic actions , 1981 .

[16]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[17]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[18]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[19]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[20]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[21]  Anthony O'Hagan,et al.  Monte Carlo is fundamentally unsound , 1987 .

[22]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[23]  G. Parisi,et al.  Statistical Field Theory , 1988 .

[24]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[25]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[26]  A. O'Hagan,et al.  Bayes–Hermite quadrature , 1991 .

[27]  Geoffrey E. Hinton,et al.  Mean field networks that learn to discriminate temporally distorted strings , 1991 .

[28]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[29]  James O. Berger,et al.  Ockham's Razor and Bayesian Analysis , 1992 .

[30]  W. Gilks,et al.  Adaptive Rejection Sampling for Gibbs Sampling , 1992 .

[31]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[32]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[33]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[34]  C. Robert,et al.  Bayesian estimation of hidden Markov chains: a stochastic implementation , 1993 .

[35]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[36]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[37]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[39]  Zoubin Ghahramani,et al.  Factorial Learning and the EM Algorithm , 1994, NIPS.

[40]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[41]  S. Frühwirth-Schnatter Bayesian Model Discrimination and Bayes Factors for Linear Gaussian State Space Models , 1995 .

[42]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[43]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[44]  Michael A. Arbib,et al.  The handbook of brain theory and neural networks , 1995, A Bradford book.

[45]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[46]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[47]  Steve R. Waterhouse,et al.  Bayesian Methods for Mixtures of Experts , 1995, NIPS.

[48]  W. Gilks,et al.  Adaptive Rejection Metropolis Sampling Within Gibbs Sampling , 1995 .

[49]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[50]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[51]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[52]  Geoffrey E. Hinton,et al.  Parameter estimation for linear dynamical systems , 1996 .

[53]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[54]  David Heckerman,et al.  Asymptotic Model Selection for Directed Networks with Hidden Variables , 1996, UAI.

[55]  Michael I. Jordan,et al.  Hidden Markov Decision Trees , 1996, NIPS.

[56]  J. Propp,et al.  Exact sampling with coupled Markov chains and applications to statistical mechanics , 1996 .

[57]  David Barber,et al.  Ensemble Learning for Multi-Layer Networks , 1997, NIPS.

[58]  Michael I. Jordan,et al.  Variational methods for inference and estimation in graphical models , 1997 .

[59]  Neil D. Lawrence,et al.  Approximating Posterior Distributions in Belief Networks Using Mixtures , 1997, NIPS.

[60]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[61]  Michael I. Jordan,et al.  Probabilistic Independence Networks for Hidden Markov Probability Models , 1997, Neural Computation.

[62]  Christopher K. I. Williams,et al.  DTs: Dynamic Trees , 1998, NIPS.

[63]  J. A. Fill An interruptible algorithm for perfect sampling via Markov chains , 1998 .

[64]  Jim Q. Smith,et al.  On the Geometry of Bayesian Graphical Models with Hidden Variables , 1998, UAI.

[65]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[66]  David J. C. Mackay,et al.  Introduction to Monte Carlo Methods , 1998, Learning in Graphical Models.

[67]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[68]  Ross D. Shachter Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams) , 1998, UAI.

[69]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[70]  Radford M. Neal Assessing Relevance determination methods using DELVE , 1998 .

[71]  Michael I. Jordan,et al.  Improving the Mean Field Approximation Via the Use of Mixture Distributions , 1999, Learning in Graphical Models.

[72]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[73]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[74]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[75]  Neil D. Lawrence,et al.  Mixture Representations for Inference and Learning in Boltzmann Machines , 1998, UAI.

[76]  P. Green,et al.  Exact Sampling from a Continuous State Space , 1998 .

[77]  G. Roberts,et al.  Adaptive Markov Chain Monte Carlo through Regeneration , 1998 .

[78]  William D. Penny,et al.  Bayesian Approaches to Gaussian Mixture Modeling , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[79]  Hagai Attias,et al.  Independent Factor Analysis , 1999, Neural Computation.

[80]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[81]  G. Casella,et al.  Perfect Slice Samplers for Mixtures of Distributions , 1999 .

[82]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[83]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[84]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[85]  A. Barvinok Polynomial time algorithms to approximate permanents and mixed discriminants within a simply exponential factor , 1999 .

[86]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[87]  David Barber,et al.  Gaussian Fields for Approximate Inference in Layered Sigmoid Belief Networks , 1999, NIPS.

[88]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[89]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[90]  Carl E. Rasmussen,et al.  Occam's Razor , 2000, NIPS.

[91]  Zoubin Ghahramani,et al.  MFDTs: mean field dynamic trees , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[92]  Radford M. Neal,et al.  Inference for Belief Networks Using Coupling From the Past , 2000, UAI.

[93]  Geoffrey E. Hinton,et al.  Variational Learning for Switching State-Space Models , 2000, Neural Computation.

[94]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[95]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[96]  Geoffrey E. Hinton,et al.  SMEM Algorithm for Mixture Models , 1998, Neural Computation.

[97]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[98]  Brendan J. Frey,et al.  Sequentially Fitting "Inclusive" Trees for Inference in Noisy-OR Networks , 2000, NIPS.

[99]  Zoubin Ghahramani,et al.  Propagation Algorithms for Variational Bayesian Learning , 2000, NIPS.

[100]  Michael I. Jordan,et al.  Bayesian parameter estimation via variational methods , 2000, Stat. Comput..

[101]  Amos J. Storkey Dynamic Trees: A Structured Variational Method Giving Efficient Propagation Rules , 2000, UAI.

[102]  Nando de Freitas,et al.  Variational MCMC , 2001, UAI.

[103]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[104]  Thomas P. Minka,et al.  Using lower bounds to approxi-mate integrals , 2001 .

[105]  Carl E. Rasmussen,et al.  Infinite Mixtures of Gaussian Process Experts , 2001, NIPS.

[106]  Erkki Oja,et al.  DYNAMICAL FACTOR ANALYSIS OF RHYTHMIC MAGNETOENCEPHALOGRAPHIC ACTIVITY , 2001 .

[107]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[108]  Zoubin Ghahramani,et al.  An Introduction to Hidden Markov Models and Bayesian Networks , 2001, Int. J. Pattern Recognit. Artif. Intell..

[109]  Yee Whye Teh,et al.  Belief Optimization for Binary Networks: A Stable Alternative to Loopy Belief Propagation , 2001, UAI.

[110]  Eric Vigoda,et al.  A polynomial-time approximation algorithm for the permanent of a matrix with non-negative entries , 2001, STOC '01.

[111]  Hilbert J. Kappen,et al.  A Tighter Bound for Graphical Models , 2001, Neural Computation.

[112]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[113]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[114]  Terrence J. Sejnowski,et al.  Variational Learning of Clusters of Undercomplete Nonsymmetric Independent Components , 2003, J. Mach. Learn. Res..

[115]  Michael I. Jordan,et al.  Graphical models: Probabilistic inference , 2002 .

[116]  Tom Heskes,et al.  Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy , 2002, NIPS.

[117]  Wray L. Buntine Variational Extensions to EM and Multinomial PCA , 2002, ECML.

[118]  David J. Spiegelhalter,et al.  VIBES: A Variational Inference Engine for Bayesian Networks , 2002, NIPS.

[119]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[120]  Juha Karhunen,et al.  An Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space Models , 2002, Neural Computation.

[121]  Carl E. Rasmussen,et al.  Bayesian Monte Carlo , 2002, NIPS.

[122]  Alan L. Yuille,et al.  CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation , 2002, Neural Computation.

[123]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[124]  Tomas Kocka,et al.  Dimension Correction for Hierarchical Latent Class Models , 2002, UAI.

[125]  Alexander G. Gray,et al.  Automatic Derivation of Statistical Algorithms: The EM Family and Beyond , 2002, NIPS.

[126]  Antti Honkela,et al.  On-line Variational Bayesian Learning , 2003 .

[127]  E. Jaynes Probability theory : the logic of science , 2003 .

[128]  Stephen J. Roberts,et al.  Variational Mixture of Bayesian Independent Component Analyzers , 2003, Neural Computation.

[129]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[130]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[131]  David J. C. MacKay,et al.  Choice of Basis for Laplace Approximation , 1998, Machine Learning.

[132]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[133]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[134]  Martin J. Wainwright,et al.  A new class of upper bounds on the log partition function , 2002, IEEE Transactions on Information Theory.