Universal Boosting Variational Inference

Boosting variational inference (BVI) approximates an intractable probability density by iteratively building up a mixture of simple component distributions one at a time, using techniques from sparse convex optimization to provide both computational scalability and approximation error guarantees. But the guarantees have strong conditions that do not often hold in practice, resulting in degenerate component optimization problems; and we show that the ad-hoc regularization used to prevent degeneracy in practice can cause BVI to fail in unintuitive ways. We thus develop universal boosting variational inference (UBVI), a BVI scheme that exploits the simple geometry of probability densities under the Hellinger metric to prevent the degeneracy of other gradient-based BVI methods, avoid difficult joint optimizations of both component and weight, and simplify fully-corrective weight optimizations. We show that for any target density and any mixture component family, the output of UBVI converges to the best possible approximation in the mixture family, even when the mixture family is misspecified. We develop a scalable implementation based on exponential family mixture components and standard stochastic optimization techniques. Finally, we discuss statistical benefits of the Hellinger distance as a variational objective through bounds on posterior probability, moment, and importance sampling errors. Experiments on multiple datasets and models show that UBVI provides reliable, accurate posterior approximations.

[1]  Xiangyu Wang,et al.  Boosting Variational Inference , 2016, ArXiv.

[2]  Richard E. Turner,et al.  Rényi Divergence Variational Inference , 2016, NIPS.

[3]  T. Jaakkola,et al.  Improving the Mean Field Approximation Via the Use of Mixture Distributions , 1999, Learning in Graphical Models.

[4]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[5]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[6]  A. Barron,et al.  Approximation and learning by greedy algorithms , 2008, 0803.1718.

[7]  Alexander J. Smola,et al.  Super-Samples from Kernel Herding , 2010, UAI.

[8]  O. Zobay Variational Bayesian inference with Gaussian-mixture approximations , 2014 .

[9]  John O'Leary,et al.  Unbiased Markov chain Monte Carlo with couplings , 2017, 1708.03625.

[10]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[11]  Michael Betancourt,et al.  The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling , 2015, ICML.

[12]  Sheng Chen,et al.  Orthogonal least squares methods and their application to non-linear system identification , 1989 .

[13]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[14]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[15]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[16]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[17]  Pierre Alquier,et al.  Concentration of tempered posteriors and of their variational approximations , 2017, The Annals of Statistics.

[18]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[19]  Aki Vehtari,et al.  Yes, but Did It Work?: Evaluating Variational Inference , 2018, ICML.

[20]  Qiang Liu,et al.  A Kernelized Stein Discrepancy for Goodness-of-fit Tests , 2016, ICML.

[21]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[22]  David M. Blei,et al.  Nonparametric variational inference , 2012, ICML.

[23]  Edward I. George,et al.  Bayes and big data: the consensus Monte Carlo algorithm , 2016, Big Data and Information Theory.

[24]  Gunnar Rätsch,et al.  Boosting Black Box Variational Inference , 2018, NeurIPS.

[25]  Trevor Campbell,et al.  Automated Scalable Bayesian Inference via Hilbert Coresets , 2017, J. Mach. Learn. Res..

[26]  Andrew Gelman,et al.  Handbook of Markov Chain Monte Carlo , 2011 .

[27]  David M. Blei,et al.  Frequentist Consistency of Variational Bayes , 2017, Journal of the American Statistical Association.

[28]  Arthur Gretton,et al.  A Kernel Test of Goodness of Fit , 2016, ICML.

[29]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[30]  P. Diaconis,et al.  The sample size required in importance sampling , 2015, 1511.01437.

[31]  I JordanMichael,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008 .

[32]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[33]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[34]  Pierre Alquier,et al.  Consistency of variational Bayes inference for estimation and model selection in mixtures , 2018, 1805.05054.

[35]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[36]  Ryan P. Adams,et al.  Variational Boosting: Iteratively Refining Posterior Approximations , 2016, ICML.

[37]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[38]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[39]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[40]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[41]  Gunnar Rätsch,et al.  Boosting Variational Inference: an Optimization Perspective , 2017, AISTATS.

[42]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[43]  A. V. D. Vaart,et al.  Convergence rates of posterior distributions , 2000 .

[44]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[45]  Lester W. Mackey,et al.  Measuring Sample Quality with Stein's Method , 2015, NIPS.

[46]  H. Haario,et al.  An adaptive Metropolis algorithm , 2001 .

[47]  Arnaud Doucet,et al.  On Markov chain Monte Carlo methods for tall data , 2015, J. Mach. Learn. Res..

[48]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[49]  Trevor Campbell,et al.  Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent , 2018, ICML.

[50]  Guillaume P. Dehaene,et al.  Expectation propagation in the large data limit , 2015, 1503.08060.

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Dustin Tran,et al.  Automatic Differentiation Variational Inference , 2016, J. Mach. Learn. Res..

[53]  J. Rosenthal,et al.  General state space Markov chains and MCMC algorithms , 2004, math/0404033.

[54]  C. Villani Optimal Transport: Old and New , 2008 .

[55]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.