Variational Inference with Tail-adaptive f-Divergence

Variational inference with α-divergences has been widely used in modern probabilistic machine learning. Compared to Kullback-Leibler (KL) divergence, a major advantage of using α-divergences (with positive α values) is their mass-covering property. However, estimating and optimizing α-divergences require to use importance sampling, which could have extremely large or infinite variances due to heavy tails of importance weights. In this paper, we propose a new class of tail-adaptive f-divergences that adaptively change the convex function f with the tail of the importance weights, in a way that theoretically guarantee finite moments, while simultaneously achieving mass-covering properties. We test our methods on Bayesian neural networks, as well as deep reinforcement learning in which our method is applied to improve a recent soft actor-critic (SAC) algorithm (Haarnoja et al., 2018). Our results show that our approach yields significant advantages compared with existing methods based on classical KL and α-divergences.

[1]  B. M. Hill,et al.  A Simple General Approach to Inference About the Tail of a Distribution , 1975 .

[2]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[3]  F. Österreicher f-DIVERGENCES-REPRESENTATION THEOREM AND METRIZABILITY , 2003 .

[4]  Imre Csiszár,et al.  Information Theory and Statistics: A Tutorial , 2004, Found. Trends Commun. Inf. Theory.

[5]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[6]  Thomas P. Minka,et al.  Divergence measures and message passing , 2005 .

[7]  Ole Winther,et al.  Expectation Consistent Approximate Inference , 2005, J. Mach. Learn. Res..

[8]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[9]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[10]  S. Resnick Heavy-Tail Phenomena: Probabilistic and Statistical Modeling , 2006 .

[11]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[12]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[13]  Jean-Michel Marin,et al.  Adaptive importance sampling in general mixture classes , 2007, Stat. Comput..

[14]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[15]  Mark D. Reid,et al.  Information, Divergence and Risk for Binary Experiments , 2009, J. Mach. Learn. Res..

[16]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[17]  Adaptive Importance Sampling via Stochastic Convex Programming , 2014, 1412.4845.

[18]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[19]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  A. Gelman,et al.  Pareto Smoothed Importance Sampling , 2015, 1507.02646.

[22]  Parallel Adaptive Importance Sampling , 2015 .

[23]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[24]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[25]  Richard E. Turner,et al.  Rényi Divergence Variational Inference , 2016, NIPS.

[26]  Richard E. Turner,et al.  Black-box α-divergence minimization , 2016, ICML 2016.

[27]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[28]  Adji B. Dieng,et al.  Variational Inference via χ Upper Bound Minimization , 2017 .

[29]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[30]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[31]  Dustin Tran,et al.  Variational Inference via \chi Upper Bound Minimization , 2016, NIPS.

[32]  Jan Peters,et al.  Mean squared advantage minimization as a consequence of entropic policy improvement regularization , 2018 .

[33]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[34]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[35]  Igal Sason,et al.  On f-Divergences: Integral Representations, Local Behavior, and Inequalities , 2018, Entropy.

[36]  Jan Peters,et al.  f-Divergence constrained policy improvement , 2017, ArXiv.