On Fenchel Mini-Max Learning

Inference, estimation, sampling and likelihood evaluation are four primary goals of probabilistic modeling. Practical considerations often force modeling approaches to make compromises between these objectives. We present a novel probabilistic learning framework, called Fenchel Mini-Max Learning (FML), that accommodates all four desiderata in a flexible and scalable manner. Our derivation is rooted in classical maximum likelihood estimation, and it overcomes a longstanding challenge that prevents unbiased estimation of unnormalized statistical models. By reformulating MLE as a mini-max game, FML enjoys an unbiased training objective that (i) does not explicitly involve the intractable normalizing constant and (ii) is directly amendable to stochastic gradient descent optimization. To demonstrate the utility of the proposed approach, we consider learning unnormalized statistical models, nonparametric density estimation and training generative models, with encouraging empirical results presented.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Yoshua Bengio,et al.  What regularized auto-encoders learn from the data-generating distribution , 2012, J. Mach. Learn. Res..

[3]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[4]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[5]  Aapo Hyvärinen,et al.  Density Estimation in Infinite Dimensional Exponential Families , 2013, J. Mach. Learn. Res..

[6]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[7]  Jascha Sohl-Dickstein,et al.  Hamiltonian Annealed Importance Sampling for partition function estimation , 2012, ArXiv.

[8]  Junichiro Hirayama,et al.  Bregman divergence as general framework to estimate unnormalized statistical models , 2011, UAI.

[9]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[10]  Toby Berger,et al.  Rate distortion theory : a mathematical basis for data compression , 1971 .

[11]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[12]  Lawrence Carin,et al.  Symmetric Variational Autoencoder and Connections to Adversarial Learning , 2017, AISTATS.

[13]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[14]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[15]  Ruslan Salakhutdinov,et al.  On the Quantitative Analysis of Decoder-Based Generative Models , 2016, ICLR.

[16]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[17]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[18]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[19]  David Duvenaud,et al.  Inference Suboptimality in Variational Autoencoders , 2018, ICML.

[20]  Shakir Mohamed,et al.  Learning in Implicit Generative Models , 2016, ArXiv.

[21]  Le Song,et al.  Kernel Exponential Family Estimation via Doubly Dual Embedding , 2018, AISTATS.

[22]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[23]  Lawrence Carin,et al.  Variational Inference and Model Selection with Generalized Evidence Bounds , 2018, ICML.

[24]  Ruslan Salakhutdinov,et al.  Accurate and conservative estimates of MRF log-likelihood using reverse annealing , 2014, AISTATS.

[25]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[26]  Jitendra Malik,et al.  Implicit Maximum Likelihood Estimation , 2018, ArXiv.

[27]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[28]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[29]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[30]  Dilin Wang,et al.  Learning to Draw Samples with Amortized Stein Variational Gradient Descent , 2017, UAI.

[31]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[32]  Richard E. Turner,et al.  Rényi Divergence Variational Inference , 2016, NIPS.

[33]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[34]  Dale Schuurmans,et al.  Exponential Family Estimation via Dynamics Embedding , 2018 .

[35]  Guoyin Wang,et al.  Adversarial Learning of a Sampler Based on an Unnormalized Distribution , 2019, AISTATS.

[36]  Ritabrata Dutta,et al.  Likelihood-free inference via classification , 2014, Stat. Comput..

[37]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[38]  C. Geyer On the Convergence of Monte Carlo Maximum Likelihood Calculations , 1994 .

[39]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[40]  Yoshua Bengio,et al.  Improving Generative Adversarial Networks with Denoising Feature Matching , 2016, ICLR.

[41]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[42]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[43]  Yee Whye Teh,et al.  Tighter Variational Bounds are Not Necessarily Better , 2018, ICML.

[44]  Yong Yu,et al.  Long Text Generation via Adversarial Training with Leaked Information , 2017, AAAI.

[45]  Robert L. Wolpert,et al.  Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.

[46]  H. Robbins A Stochastic Approximation Method , 1951 .

[47]  Garud Iyengar,et al.  Unbiased scalable softmax optimization , 2018, ArXiv.

[48]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[49]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[50]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[51]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[52]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[53]  Chang Liu,et al.  Variational Annealing of GANs: A Langevin Perspective , 2019, ICML.

[54]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[55]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[56]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[57]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[58]  Hedvig Kjellström,et al.  Advances in Variational Inference , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Sebastian Nowozin,et al.  Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks , 2017, ICML.

[60]  Bai Li,et al.  A Unified Particle-Optimization Framework for Scalable Bayesian Sampling , 2018, UAI.

[61]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[62]  Martin J. Wainwright,et al.  Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization , 2007, NIPS.

[63]  E. Tabak,et al.  DENSITY ESTIMATION BY DUAL ASCENT OF THE LOG-LIKELIHOOD ∗ , 2010 .

[64]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[65]  Richard E. Turner,et al.  Gradient Estimators for Implicit Models , 2017, ICLR.

[66]  Leon Hirsch,et al.  Fundamentals Of Convex Analysis , 2016 .

[67]  Arthur Gretton,et al.  Efficient and principled score estimation with Nyström kernel exponential families , 2017, AISTATS.