Optimizing variational representations of divergences and accelerating their statistical estimation

Variational representations of distances and divergences between high-dimensional probability distributions offer significant theoretical insights and practical advantages in numerous research areas. Recently, they have gained popularity in machine learning as a tractable and scalable approach for training probabilistic models and statistically differentiate between data distributions. Their advantages include: 1) They can be estimated from data. 2) Such representations can leverage the ability of neural networks to efficiently approximate optimal solutions in function spaces. However, a systematic and practical approach to improving the tightness of such variational formulas, and accordingly accelerate statistical learning and estimation from data, is currently lacking. Here we develop a systematic methodology for building new, tighter variational representations of divergences. Our approach relies on improved objective functionals constructed via an auxiliary optimization problem. Furthermore, the calculation of the functional Hessian of objective functionals unveils the local curvature differences around the common optimal variational solution; this allows us to quantify and order relative tightness gains between different variational representations. Finally, numerical simulations utilizing neural network optimization demonstrate that tighter representations can result in significantly faster learning and more accurate estimation of divergences in both synthetic and real datasets (of more than 700 dimensions), often accelerated by nearly an order of magnitude.

[1]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[2]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[3]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[4]  Petr Plechác,et al.  Information-theoretic tools for parametrized coarse-graining of non-equilibrium extended systems , 2013, The Journal of chemical physics.

[5]  Mark D. Reid,et al.  Tighter Variational Representations of f-Divergences via Restriction to Probability Measures , 2012, ICML.

[6]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[7]  R. Keener Theoretical Statistics: Topics for a Core Course , 2010 .

[8]  Yannis Stylianou,et al.  Cumulant GAN , 2020, ArXiv.

[9]  Sharif Rahman,et al.  The f-Sensitivity Index , 2015, SIAM/ASA J. Uncertain. Quantification.

[10]  A. W. van der Vaart,et al.  Uniform Central Limit Theorems , 2001 .

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  M. Teboulle,et al.  AN OLD‐NEW CONCEPT OF CONVEX RISK MEASURES: THE OPTIMIZED CERTAINTY EQUIVALENT , 2007 .

[13]  Jianfeng Feng,et al.  Chi-square Generative Adversarial Network , 2018, ICML.

[14]  Martin J. Wainwright,et al.  Nonparametric estimation of the likelihood ratio and divergence functionals , 2007, 2007 IEEE International Symposium on Information Theory.

[15]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[16]  Andrew J Majda,et al.  Improving model fidelity and sensitivity for complex systems through empirical information theory , 2011, Proceedings of the National Academy of Sciences.

[17]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[18]  E Weinan,et al.  Heterogeneous multiscale methods: A review , 2007 .

[19]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[20]  M. Broniatowski,et al.  Minimization of divergences on sets of signed measures , 2010, 1003.5457.

[21]  C. Villani Optimal Transport: Old and New , 2008 .

[22]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[23]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[24]  J. Lynch,et al.  A weak convergence approach to the theory of large deviations , 1997 .

[25]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[26]  C. Landim,et al.  Scaling Limits of Interacting Particle Systems , 1998 .

[27]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[28]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[29]  G. Crooks On Measures of Entropy and Information , 2015 .

[30]  Paul Dupuis,et al.  Distinguishing and integrating aleatoric and epistemic variation in uncertainty quantification , 2011, 1103.1861.

[31]  Yannis Pantazis,et al.  Path-Space Information Bounds for Uncertainty Quantification and Sensitivity Analysis of Stochastic Dynamics , 2015, SIAM/ASA J. Uncertain. Quantification.

[32]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[33]  P. Dupuis,et al.  Distributional Robustness and Uncertainty Quantification for Rare Events. , 2019, 1911.09580.

[34]  Paul Dupuis,et al.  Robust Bounds on Risk-Sensitive Functionals via Rényi Divergence , 2013, SIAM/ASA J. Uncertain. Quantification.

[35]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[36]  Stefano Ermon,et al.  Bridging the Gap Between $f$-GANs and Wasserstein GANs , 2019, ICML.

[37]  M Scott Shell,et al.  The relative entropy is fundamental to multiscale and inverse thermodynamic problems. , 2008, The Journal of chemical physics.

[38]  Michel Broniatowski,et al.  Parametric estimation and tests through divergences and the duality technique , 2008, J. Multivar. Anal..

[39]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.