Telescoping Density-Ratio Estimation

Density-ratio estimation via classification is a cornerstone of unsupervised learning. It has provided the foundation for state-of-the-art methods in representation learning and generative modelling, with the number of use-cases continuing to proliferate. However, it suffers from a critical limitation: it fails to accurately estimate ratios p/q for which the two densities differ significantly. Empirically, we find this occurs whenever the KL divergence between p and q exceeds tens of nats. To resolve this limitation, we introduce a new framework, telescoping density-ratio estimation (TRE), that enables the estimation of ratios between highly dissimilar densities in high-dimensional spaces. Our experiments demonstrate that TRE can yield substantial improvements over existing single-ratio methods for mutual information estimation, representation learning and energy-based modelling.

[1]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[2]  Cheng Soon Ong,et al.  Linking losses for density ratio and class-probability estimation , 2016, ICML.

[3]  Jiquan Ngiam,et al.  Learning Deep Energy Models , 2011, ICML.

[4]  David J. Nott,et al.  A note on approximating ABC‐MCMC using flexible classifiers , 2014 .

[5]  Takafumi Kanamori,et al.  Density Ratio Estimation in Machine Learning , 2012 .

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[8]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[9]  Aapo Hyvärinen,et al.  Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA , 2016, NIPS.

[10]  Stefano Ermon,et al.  Boosted Generative Models , 2016, AAAI.

[11]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[12]  Zhuowen Tu,et al.  Learning Generative Models via Discriminative Approaches , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[14]  Karl Stratos,et al.  Formal Limitations on the Measurement of Mutual Information , 2018, AISTATS.

[15]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[16]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[17]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[18]  Michael U. Gutmann,et al.  Bayesian Experimental Design for Implicit Models by Mutual Information Neural Estimation , 2020, ICML.

[19]  Mohammad Norouzi,et al.  Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One , 2019, ICLR.

[20]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[21]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[22]  Igor Mordatch,et al.  Implicit Generation and Modeling with Energy Based Models , 2019, NeurIPS.

[23]  D. Schuurmans,et al.  Energy-Based Processes for Exchangeable Data , 2020, International Conference on Machine Learning.

[24]  Stefano Ermon,et al.  Fair Generative Modeling via Weak Supervision , 2020, ICML.

[25]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[26]  Zhuowen Tu,et al.  Introspective Neural Networks for Generative Modeling , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28]  Michael U. Gutmann,et al.  Dynamic Likelihood-free Inference via Ratio Estimation (DIRE) , 2018, ArXiv.

[29]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[30]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Myle Ott,et al.  Energy-Based Models for Text , 2020, ArXiv.

[32]  Iain Murray,et al.  On Contrastive Learning for Likelihood-free Inference , 2020, ICML.

[33]  Andrea Vedaldi,et al.  Improved Texture Networks: Maximizing Quality and Diversity in Feed-Forward Stylization and Texture Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jukka Corander,et al.  Likelihood-Free Inference by Ratio Estimation , 2016, Bayesian Analysis.

[35]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[36]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[37]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[38]  Lantao Yu,et al.  Training Deep Energy-Based Models with f-Divergence Minimization , 2020, ICML.

[39]  R. Zemel,et al.  Cutting out the Middle-Man: Training and Evaluating Energy-Based Models without Sampling , 2020, ICML 2020.

[40]  Aapo Hyvärinen,et al.  A Family of Computationally E cient and Simple Estimators for Unnormalized Statistical Models , 2010, UAI.

[41]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[42]  Michael U. Gutmann,et al.  Variational Noise-Contrastive Estimation , 2018, AISTATS.

[43]  Song-Chun Zhu,et al.  Learning Descriptor Networks for 3D Shape Synthesis and Analysis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Stefano Ermon,et al.  Understanding the Limitations of Variational Mutual Information Estimators , 2020, ICLR.

[45]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[46]  Michael Figurnov,et al.  Monte Carlo Gradient Estimation in Machine Learning , 2019, J. Mach. Learn. Res..

[47]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[48]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[49]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[50]  Masatoshi Uehara,et al.  Analysis of Noise Contrastive Estimation from the Perspective of Asymptotic Variance , 2018, ArXiv.

[51]  Charlie Nash,et al.  Autoregressive Energy Machines , 2019, ICML.

[52]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[53]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[54]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[55]  Junichiro Hirayama,et al.  Bregman divergence as general framework to estimate unnormalized statistical models , 2011, UAI.

[56]  Iain Murray,et al.  Neural Spline Flows , 2019, NeurIPS.

[57]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[58]  Lei Yu,et al.  A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[59]  Ruslan Salakhutdinov,et al.  Accurate and conservative estimates of MRF log-likelihood using reverse annealing , 2014, AISTATS.

[60]  Aapo Hyvärinen,et al.  Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning , 2018, AISTATS.

[61]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[62]  Michael U. Gutmann,et al.  Conditional Noise-Contrastive Estimation of Unnormalised Models , 2018, ICML.

[63]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[64]  Sergey Levine,et al.  VideoFlow: A Flow-Based Generative Model for Video , 2019, ArXiv.

[65]  Eric Horvitz,et al.  Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting , 2019, DGS@ICLR.

[66]  Michael U. Gutmann,et al.  Efficient Bayesian Experimental Design for Implicit Models , 2018, AISTATS.

[67]  Zhao Chen,et al.  GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.

[68]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[69]  Shakir Mohamed,et al.  Learning in Implicit Generative Models , 2016, ArXiv.

[70]  Masashi Sugiyama,et al.  Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation , 2008, SDM.

[71]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[72]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Sergey Levine,et al.  Wasserstein Dependency Measure for Representation Learning , 2019, NeurIPS.

[74]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[75]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[76]  Nicolas Chopin,et al.  Noise contrastive estimation: asymptotics, comparison with MC-MLE , 2018, 1801.10381.

[77]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[78]  Le Song,et al.  Exponential Family Estimation via Adversarial Dynamics Embedding , 2019, NeurIPS.

[79]  Joshua V. Dillon,et al.  NeuTra-lizing Bad Geometry in Hamiltonian Monte Carlo Using Neural Transport , 2019, 1903.03704.