Variational approximations using Fisher divergence

Modern applications of Bayesian inference involve models that are sufficiently complex that the corresponding posterior distributions are intractable and must be approximated. The most common approximation is based on Markov chain Monte Carlo, but these can be expensive when the data set is large and/or the model is complex, so more efficient variational approximations have recently received considerable attention. The traditional variational methods, that seek to minimize the Kullback--Leibler divergence between the posterior and a relatively simple parametric family, provide accurate and efficient estimation of the posterior mean, but often does not capture other moments, and have limitations in terms of the models to which they can be applied. Here we propose the construction of variational approximations based on minimizing the Fisher divergence, and develop an efficient computational algorithm that can be applied to a wide range of models without conjugacy or potentially unrealistic mean-field assumptions. We demonstrate the superior performance of the proposed method for the benchmark case of logistic regression.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[3]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[4]  Stephen G. Walker,et al.  Bayesian information in an experiment and the Fisher information distance , 2016 .

[5]  L. Karlovitz,et al.  Construction of nearest points in the Lp, p even, and L∞ norms. I , 1970 .

[6]  A. J. Stam Some Inequalities Satisfied by the Quantities of Information of Fisher and Shannon , 1959, Inf. Control..

[7]  A. Barron,et al.  Fisher information inequalities and the central limit theorem , 2001, math/0111020.

[8]  C. Holmes,et al.  Assigning a value to a power likelihood in a general Bayesian model , 2017, 1701.08515.

[9]  Richard E. Turner,et al.  Rényi Divergence Variational Inference , 2016, NIPS.

[10]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[11]  Chong Wang,et al.  Variational inference in nonconjugate models , 2012, J. Mach. Learn. Res..

[12]  Michael I. Jordan,et al.  A variational approach to Bayesian logistic regression problems and their extensions , 1996 .

[13]  Trevor Campbell,et al.  Practical bounds on the error of Bayesian posterior approximations: A nonasymptotic approach , 2018, ArXiv.

[14]  C. Sidney Burrus,et al.  Iterative reweighted least-squares design of FIR filters , 1994, IEEE Trans. Signal Process..

[15]  Miguel Lázaro-Gredilla,et al.  Doubly Stochastic Variational Bayes for non-Conjugate Inference , 2014, ICML.

[16]  Erchin Serpedin,et al.  On the Equivalence Between Stein and De Bruijn Identities , 2012, IEEE Transactions on Information Theory.

[17]  Jon D. McAuliffe,et al.  Variational Inference for Large-Scale Models of Discrete Choice , 2007, 0712.2526.

[18]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[19]  A. Dasgupta Asymptotic Theory of Statistics and Probability , 2008 .

[20]  I. Vajda,et al.  Convex Statistical Distances , 2018, Statistical Inference for Engineers and Data Scientists.

[21]  Craig A. Tovey,et al.  The Simplex and Projective Scaling Algorithms as Iteratively Reweighted Least Squares Methods , 1991, SIAM Rev..

[22]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[23]  O. Johnson Information Theory And The Central Limit Theorem , 2004 .

[24]  A. Barron ENTROPY AND THE CENTRAL LIMIT THEOREM , 1986 .

[25]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[26]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[27]  Pieter Bastiaan Ober Asymptotic Theory of Statistics and Probability , 2011 .