MINE: Mutual Information Neural Estimation

This paper presents a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size. MINE is back-propable and we prove that it is strongly consistent. We illustrate a handful of applications in which MINE is succesfully applied to enhance the property of generative models in both unsupervised and supervised settings. We apply our framework to estimate the information bottleneck, and apply it in tasks related to supervised classification problems. Our results demonstrate substantial added flexibility and improvement in these settings.

[1]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[2]  Suguru Arimoto,et al.  An algorithm for computing the capacity of arbitrary discrete memoryless channels , 1972, IEEE Trans. Inf. Theory.

[3]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[4]  Fraser,et al.  Independent coordinates for strange attractors from mutual information. , 1986, Physical review. A, General physics.

[5]  L. Györfi,et al.  Density-free convergence properties of various estimators of entropy , 1987 .

[6]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[7]  P. Hall The Bootstrap and Edgeworth Expansion , 1992 .

[8]  Moon,et al.  Estimation of mutual information using kernel density estimators. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[9]  Guy Marchal,et al.  Multimodality image registration by maximization of mutual information , 1997, IEEE Transactions on Medical Imaging.

[10]  Igor Vajda,et al.  Estimation of the Information by an Adaptive Partitioning of the Observation Space , 1999, IEEE Trans. Inf. Theory.

[11]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[12]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[13]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[14]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[16]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[17]  David Barber,et al.  The IM algorithm: a variational approach to Information Maximization , 2003, NIPS 2003.

[18]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[20]  M. V. Van Hulle,et al.  Edgeworth Approximation of Multivariate Differential Entropy , 2005, Neural Computation.

[21]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[23]  Takafumi Kanamori,et al.  Approximating Mutual Information by Maximum Likelihood Density Ratio Estimation , 2008, FSDM.

[24]  Kenji Fukumizu,et al.  On integral probability metrics, φ-divergences and binary classification , 2009, 0901.2698.

[25]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[26]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[27]  Mark D. Reid,et al.  Tighter Variational Representations of f-Divergences via Restriction to Probability Measures , 2012, ICML.

[28]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[29]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[30]  J. Kinney,et al.  Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[31]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[32]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[33]  Aram Galstyan,et al.  Efficient Estimation of Mutual Information for Strongly Dependent Variables , 2014, AISTATS.

[34]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  James M. Robins,et al.  Nonparametric von Mises Estimators for Entropies, Divergences and Mutual Informations , 2015, NIPS.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[38]  Barnabás Póczos,et al.  Finite-Sample Analysis of Fixed-k Nearest Neighbor Density Functional Estimators , 2016, NIPS.

[39]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[40]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[41]  Olivier Marre,et al.  Relevant sparse codes with variational information bottleneck , 2016, NIPS.

[42]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[43]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[44]  Philip Bachman,et al.  Calibrating Energy-based Generative Adversarial Networks , 2017, ICLR.

[45]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[46]  Sebastian Nowozin,et al.  Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks , 2017, ICML.

[47]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[48]  Alfred O. Hero,et al.  Ensemble estimation of mutual information , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[49]  Trung Le,et al.  Dual Discriminator Generative Adversarial Nets , 2017, NIPS.

[50]  David Pfau,et al.  Unrolled Generative Adversarial Networks , 2016, ICLR.

[51]  Andrew Gordon Wilson,et al.  Bayesian GAN , 2017, NIPS.

[52]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[53]  Andrea Vedaldi,et al.  Adversarial Generator-Encoder Networks , 2017, ArXiv.

[54]  Charles A. Sutton,et al.  VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning , 2017, NIPS.

[55]  Abhishek Kumar,et al.  Improved Semi-supervised Learning with GANs using Manifold Invariances , 2017, NIPS 2017.

[56]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[57]  Tom Sercu,et al.  Fisher GAN , 2017, NIPS.

[58]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[59]  Yoshua Bengio,et al.  Mode Regularized Generative Adversarial Networks , 2016, ICLR.

[60]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[61]  Lawrence Carin,et al.  ALICE: Towards Understanding Adversarial Learning for Joint Distribution Matching , 2017, NIPS.

[62]  Aaron C. Courville,et al.  Hierarchical Adversarially Learned Inference , 2018, ArXiv.

[63]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[64]  Philip H. S. Torr,et al.  Multi-agent Diverse Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Yoshua Bengio,et al.  Learning Independent Features with Adversarial Nets for Non-linear ICA , 2017, 1710.05050.

[66]  Andrea Vedaldi,et al.  It Takes (Only) Two: Adversarial Generator-Encoder Networks , 2017, AAAI.

[67]  Harshad Rai,et al.  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .

[68]  David H. Wolpert,et al.  Nonlinear Information Bottleneck , 2017, Entropy.

[69]  Ashish Khetan,et al.  PacGAN: The Power of Two Samples in Generative Adversarial Networks , 2017, IEEE Journal on Selected Areas in Information Theory.