Fixing a Broken ELBO

Recent work in unsupervised representation learning has focused on learning deep directed latent-variable models. Fitting these models by maximizing the marginal likelihood or evidence is typically intractable, thus a common approximation is to maximize the evidence lower bound (ELBO) instead. However, maximum likelihood training (whether exact or approximate) does not necessarily result in a good latent representation, as we demonstrate both theoretically and empirically. In particular, we derive variational lower and upper bounds on the mutual information between the input and the latent variable, and use these bounds to derive a rate-distortion curve that characterizes the tradeoff between compression and reconstruction accuracy. Using this framework, we demonstrate that there is a family of models with identical ELBO, but different quantitative and qualitative characteristics. Our framework also suggests a simple new method to ensure that latent variable models with powerful stochastic decoders do not ignore their latent code.

[1]  Daan Wierstra,et al.  Towards Conceptual Compression , 2016, NIPS.

[2]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[3]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[4]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[5]  Max Welling,et al.  VAE with a VampPrior , 2017, AISTATS.

[6]  Felix Agakov,et al.  Variational Information Maximization in Stochastic Environments , 2006 .

[7]  Hugo Larochelle,et al.  MADE: Masked Autoencoder for Distribution Estimation , 2015, ICML.

[8]  David Barber,et al.  Information Maximization in Noisy Channels : A Variational Approach , 2003, NIPS.

[9]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[10]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[11]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[14]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[15]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[16]  Jaime G. Carbonell,et al.  Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training , 2017, ArXiv.

[17]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[18]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[19]  Sebastian Nowozin,et al.  The Mutual Autoencoder: Controlling Information in Latent Code Representations , 2018 .

[20]  Valero Laparra,et al.  End-to-end Optimized Image Compression , 2016, ICLR.

[21]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[22]  Ohad Shamir,et al.  Learning and generalization with the information bottleneck , 2008, Theor. Comput. Sci..

[23]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[24]  David Minnen,et al.  Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[26]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[27]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[28]  W. Bialek,et al.  Information-based clustering. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[30]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.

[31]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[32]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[33]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[34]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[35]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[36]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.