MIM: Mutual Information Machine

We introduce the Mutual Information Machine (MIM), a probabilistic auto-encoder for learning joint distributions over observations and latent variables. MIM reflects three design principles: 1) low divergence, to encourage the encoder and decoder to learn consistent factorizations of the same underlying distribution; 2) high mutual information, to encourage an informative relation between data and latent variables; and 3) low marginal entropy, or compression, which tends to encourage clustered latent representations. We show that a combination of the Jensen-Shannon divergence and the joint entropy of the encoding and decoding distributions satisfies these criteria, and admits a tractable cross-entropy bound that can be optimized directly with Monte Carlo and stochastic gradient descent. We contrast MIM learning with maximum likelihood and VAEs. Experiments show that MIM learns representations with high mutual information, consistent encoding and decoding distributions, effective latent clustering, and data log likelihood comparable to VAE, while avoiding posterior collapse.

[1]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[2]  Graham Neubig,et al.  Lagging Inference Networks and Posterior Collapse in Variational Autoencoders , 2019, ICLR.

[3]  Max Welling,et al.  VAE with a VampPrior , 2017, AISTATS.

[4]  Pramod Viswanath,et al.  Demystifying fixed k-nearest neighbor information estimators , 2016, 2017 IEEE International Symposium on Information Theory (ISIT).

[5]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[6]  S. Ermon,et al.  The Information-Autoencoding Family: A Lagrangian Perspective on Latent Variable Generative Modeling , 2018 .

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[9]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[10]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[11]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[12]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[13]  Alexander A. Alemi,et al.  An Information-Theoretic Analysis of Deep Latent-Variable Models , 2017, ArXiv.

[14]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[15]  Hyunjung Shim,et al.  High Quality Bidirectional Generative Adversarial Networks , 2018, ArXiv.

[16]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[17]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[18]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[19]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[20]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[23]  Yoshua Bengio,et al.  Bidirectional Helmholtz Machines , 2015, ICML.

[24]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[25]  Ali Razavi,et al.  Preventing Posterior Collapse with delta-VAEs , 2019, ICLR.

[26]  Lawrence Carin,et al.  ALICE: Towards Understanding Adversarial Learning for Joint Distribution Matching , 2017, NIPS.

[27]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[29]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[30]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[31]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[32]  Zhe Gan,et al.  Adversarial Symmetric Variational Autoencoder , 2017, NIPS.

[33]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[34]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[35]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[36]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[39]  Pieter Abbeel,et al.  Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design , 2019, ICML.

[40]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[41]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[42]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[43]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[44]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.

[45]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[46]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.