KALE: When Energy-Based Learning Meets Adversarial Training

Legendre duality provides a variational lowerbound for the Kullback-Leibler divergence (KL) which can be estimated using samples, without explicit knowledge of the density ratio. We use this estimator, the KL Approximate Lower-bound Estimate (KALE), in a contrastive setting for learning energy-based models, and show that it provides a maximum likelihood estimate (MLE). We then extend this procedure to adversarial training, where the discriminator represents the energy and the generator is the base measure of the energy-based model. Unlike in standard generative adversarial networks (GANs), the learned model makes use of both generator and discriminator to generate samples. This is achieved using Hamiltonian Monte Carlo in the latent space of the generator, using information from the discriminator, to find regions in that space that produce better quality samples. We also show that, unlike the KL, KALE enjoys smoothness properties that make it suitable for adversarial training, and provide convergence rates for KALE when the negative log density ratio belongs to the variational family. Finally, we demonstrate the effectiveness of this approach on simple datasets.

[1]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[2]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[3]  Jason Yosinski,et al.  Metropolis-Hastings Generative Adversarial Networks , 2018, ICML.

[4]  Eric Horvitz,et al.  Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting , 2019, DGS@ICLR.

[5]  Trevor Darrell,et al.  Discriminator Rejection Sampling , 2018, ICLR.

[6]  David Lopez-Paz,et al.  Geometrical Insights for Implicit Generative Modeling , 2017, Braverman Readings in Machine Learning.

[7]  Akinori Tanaka,et al.  Discriminator optimal transport , 2019, NeurIPS.

[8]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[9]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[10]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[11]  Kenji Fukumizu,et al.  Smoothness and Stability in GANs , 2020, ICLR.

[12]  Timothy Lillicrap,et al.  Deep Compressed Sensing , 2019, ICML.

[13]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[14]  Le Song,et al.  Kernel Exponential Family Estimation via Doubly Dual Embedding , 2018, AISTATS.

[15]  Yee Whye Teh,et al.  Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise , 2020, ICML.

[16]  Igor Mordatch,et al.  Implicit Generation and Modeling with Energy Based Models , 2019, NeurIPS.

[17]  Dmitriy Drusvyatskiy,et al.  Stochastic subgradient method converges at the rate O(k-1/4) on weakly convex functions , 2018, ArXiv.

[18]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[19]  George Tucker,et al.  Energy-Inspired Models: Learning with Sampler-Induced Distributions , 2019, NeurIPS.

[20]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[21]  Hugo Larochelle,et al.  MADE: Masked Autoencoder for Distribution Estimation , 2015, ICML.

[22]  Arthur Gretton,et al.  Learning deep kernels for exponential family densities , 2018, ICML.

[23]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[24]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[25]  Stephen P. Boyd,et al.  Graph Implementations for Nonsmooth Convex Programs , 2008, Recent Advances in Learning and Control.

[26]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[27]  Yan Wu,et al.  LOGAN: Latent Optimisation for Generative Adversarial Networks , 2019, ArXiv.

[28]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[29]  Xin Ding,et al.  Subsampling Generative Adversarial Networks: Density Ratio Estimation in Feature Space With Softplus Loss , 2020, IEEE Transactions on Signal Processing.

[30]  Paul R. Milgrom,et al.  Envelope Theorems for Arbitrary Choice Sets , 2002 .

[31]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[32]  I. Ekeland,et al.  Convex analysis and variational problems , 1976 .

[33]  Le Song,et al.  Exponential Family Estimation via Adversarial Dynamics Embedding , 2019, NeurIPS.

[34]  Jason D. Lee,et al.  On the Convergence and Robustness of Training GANs with Regularized Optimal Transport , 2018, NeurIPS.

[35]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[36]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[37]  Alexander J. Smola,et al.  Kernel methods and the exponential family , 2006, ESANN.

[38]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[39]  Arthur Gretton,et al.  Efficient and principled score estimation with Nyström kernel exponential families , 2017, AISTATS.

[40]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[41]  Vincent Danos,et al.  Langevin Dynamics with Variable Coefficients and Nonconservative Forces: From Stationary States to Numerical Methods , 2017, Entropy.

[42]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[43]  Bernhard Schölkopf,et al.  Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions , 2016, J. Mach. Learn. Res..

[44]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[45]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[46]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[47]  Karl-Theodor Sturm,et al.  Diffusion processes and heat kernels on metric spaces , 1998 .

[48]  Prateek Jain,et al.  Efficient Algorithms for Smooth Minimax Optimization , 2019, NeurIPS.

[49]  Aapo Hyvärinen,et al.  Density Estimation in Infinite Dimensional Exponential Families , 2013, J. Mach. Learn. Res..

[50]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[51]  A. Eberle,et al.  Couplings and quantitative contraction rates for Langevin dynamics , 2017, The Annals of Probability.

[52]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[53]  J. Retherford Review: J. Diestel and J. J. Uhl, Jr., Vector measures , 1978 .

[54]  Arthur Gretton,et al.  Kernel Conditional Exponential Family , 2017, AISTATS.

[55]  Yingyu Liang,et al.  Generalization and Equilibrium in Generative Adversarial Nets (GANs) , 2017, ICML.