Estimating Total Correlation with Mutual Information Estimators

Total correlation (TC) is a fundamental concept in information theory that measures statistical dependency among multiple random variables. Recently, TC has shown noticeable effectiveness as a regularizer in many learning tasks, where the correlation among multiple latent embeddings requires to be jointly minimized or maximized. However, calculating precise TC values is challenging, especially when the closed-form distributions of embedding variables are unknown. In this paper, we introduce a unified framework to estimate total correlation values with sample-based mutual information (MI) estimators. More specifically, we discover a relation between TC and MI and propose two types of calculation paths (tree-like and line-like) to decompose TC into MI terms. With each MI term being bounded, the TC values can be successfully estimated. Further, we provide theoretical analyses concerning the statistical consistency of the proposed TC estimators. Experiments are presented on both synthetic and real-world scenarios, where our estimators demonstrate effectiveness in all TC estimation, minimization, and maximization tasks. The code is available at https://github.com/Linear95/TC-estimation.

[1]  Peng Cheng,et al.  Replacing Language Model for Style Transfer , 2022, ArXiv.

[2]  Qi Zhang,et al.  PromptBERT: Improving BERT Sentence Embeddings with Prompts , 2022, EMNLP.

[3]  Fuzheng Zhang,et al.  ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer , 2021, ACL.

[4]  Danqi Chen,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[5]  Zhe Gan,et al.  Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning , 2021, ICLR.

[6]  Lawrence Carin,et al.  FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders , 2021, ICLR.

[7]  Kwan Hui Lim,et al.  An Unsupervised Sentence Embedding Method by Mutual Information Maximization , 2020, EMNLP.

[8]  Zhe Gan,et al.  CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information , 2020, ICML.

[9]  Lawrence Carin,et al.  Improving Disentangled Text Representation Learning with Information-Theoretic Guidance , 2020, ACL.

[10]  Bjorn Ommer,et al.  A Disentangling Invertible Interpretation Network for Explaining Latent Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  David E. Evans,et al.  Learning Adversarially Robust Representations via Worst-Case Mutual Information Maximization , 2020, ICML.

[12]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[13]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[14]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[16]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[17]  Stefan Bauer,et al.  On the Fairness of Disentangled Representations , 2019, NeurIPS.

[18]  Toniann Pitassi,et al.  Flexibly Fair Representation Learning by Disentanglement , 2019, ICML.

[19]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[20]  Kate Saenko,et al.  Domain Agnostic Learning with Disentangled Representations , 2019, ICML.

[21]  Stefano Ermon,et al.  Learning Controllable Fair Representations , 2018, AISTATS.

[22]  Bernhard Schölkopf,et al.  Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , 2018, ICML.

[23]  Vladimir Pavlovic,et al.  Unsupervised Multi-Target Domain Adaptation: An Information Theoretic Approach , 2018, IEEE Transactions on Image Processing.

[24]  R. Devon Hjelm,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[25]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[26]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[27]  Rob Brekelmans,et al.  Auto-Encoding Total Correlation Explanation , 2018, AISTATS.

[28]  Andriy Mnih,et al.  Disentangling by Factorising , 2018, ICML.

[29]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[30]  David Duvenaud,et al.  Isolating Sources of Disentanglement in VAEs , 2018, 1802.04942.

[31]  Aaron C. Courville,et al.  Mutual Information Neural Estimation , 2018, ICML.

[32]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[33]  Cristina Marino Buslje,et al.  MIToS.jl: mutual information tools for protein sequence analysis in the Julia language , 2016, Bioinform..

[34]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[35]  J. Schulman,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[36]  Andrea Califano,et al.  ARACNe-AP: gene network reverse engineering through adaptive partitioning inference of mutual information , 2016, Bioinform..

[37]  Pramod Viswanath,et al.  Demystifying fixed k-nearest neighbor information estimators , 2016, 2017 IEEE International Symposium on Information Theory (ISIT).

[38]  James M. Robins,et al.  Nonparametric von Mises Estimators for Entropies, Divergences and Mutual Informations , 2015, NIPS.

[39]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[40]  Vijay Kumar,et al.  Information-theoretic mapping using Cauchy-Schwarz Quadratic Mutual Information , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Bo Jiang,et al.  Nonparametric K-Sample Tests via Dynamic Slicing , 2015 .

[42]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[43]  Barnabás Póczos,et al.  Exponential Concentration of a Density Functional Estimator , 2014, NIPS.

[44]  Claire Cardie,et al.  SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.

[45]  Aaron C. Courville,et al.  Generative Adversarial Nets , 2014, NIPS.

[46]  Aram Galstyan,et al.  Discovering Structure in High-Dimensional Data Through Correlation Explanation , 2014, NIPS.

[47]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[48]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[49]  Daniela Rus,et al.  On mutual information-based control of range sensing robots for mapping applications , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[50]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[51]  Timothy Baldwin,et al.  Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity , 2013 .

[52]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[53]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[54]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[55]  Barnabás Póczos,et al.  Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs , 2010, NIPS.

[56]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[57]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[58]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[59]  David Barber,et al.  The IM algorithm: a variational approach to Information Maximization , 2003, NIPS 2003.

[60]  Jean-François Cardoso,et al.  Dependence, Correlation and Gaussianity in Independent Component Analysis , 2003, J. Mach. Learn. Res..

[61]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[62]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[63]  M. Studený,et al.  The Multiinformation Function as a Tool for Measuring Stochastic Dependence , 1998, Learning in Graphical Models.

[64]  C. Granger,et al.  USING THE MUTUAL INFORMATION COEFFICIENT TO IDENTIFY LAGS IN NONLINEAR MODELS , 1994 .

[65]  Jerome L. Myers,et al.  Research Design and Statistical Analysis , 1991 .

[66]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[67]  Walter L. Smith Probability and Statistics , 1959, Nature.

[68]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[69]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[70]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[71]  Enst Dependence, correlation and Gaussianity in independent component analysis , 2003 .

[72]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[73]  Michael Satosi Watanabe,et al.  Information Theoretical Analysis of Multivariate Correlation , 1960, IBM J. Res. Dev..