Measuring Dependence with Matrix-based Entropy Functional

Measuring the dependence of data plays a central role in statistics and machine learning. In this work, we summarize and generalize the main idea of existing information-theoretic dependence measures into a higher-level perspective by the Shearer’s inequality. Based on our generalization, we then propose two measures, namely the matrix-based normalized total correlation and the matrix-based normalized dual total correlation, to quantify the dependence of multiple variables in arbitrary dimensional space, without explicit estimation of the underlying data distributions. We show that our measures are differentiable and statistically more powerful than prevalent ones. We also show the impact of our measures in four different machine learning problems, namely the gene regulatory network inference, the robust machine learning under covariate shift and non-Gaussian noises, the subspace outlier detection, and the understanding of the learning dynamics of convolutional neural networks, to demonstrate their utilities, advantages, as well as implications to those problems.

[1]  Michael Satosi Watanabe,et al.  Information Theoretical Analysis of Multivariate Correlation , 1960, IBM J. Res. Dev..

[2]  Te Sun Han,et al.  Linear Dependence Structure of the Entropy Space , 1975, Inf. Control..

[3]  Paul Kalata,et al.  Linear prediction, filtering, and smoothing: An information-theoretic approach , 1979, Inf. Sci..

[4]  J. Magnus On Differentiating Eigenvalues and Eigenvectors , 1985, Econometric Theory.

[5]  Fan Chung Graham,et al.  Some intersection theorems for ordered sets and graphs , 1986, J. Comb. Theory, Ser. A.

[6]  H. Joe Relative Entropy Measures of Multivariate Dependence , 1989 .

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  K. Loparo,et al.  Optimal state estimation for stochastic systems: an information theoretic approach , 1997, IEEE Trans. Autom. Control..

[9]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[10]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[11]  John W. Fisher,et al.  Learning from Examples with Information Theoretic Criteria , 2000, J. VLSI Signal Process..

[12]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[13]  Deniz Erdogmus,et al.  An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems , 2002, IEEE Trans. Signal Process..

[14]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[15]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Yunmei Chen,et al.  Cumulative residual entropy: a new measure of information , 2004, IEEE Transactions on Information Theory.

[18]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[19]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[20]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[21]  Rajendra Bhatia,et al.  Infinitely Divisible Matrices , 2006, Am. Math. Mon..

[22]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[23]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[24]  Friedrich Schmid,et al.  Multivariate Extensions of Spearman's Rho and Related Statistics , 2007 .

[25]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[26]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .

[27]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[28]  Zeng-qi Sun,et al.  Information Theoretic Interpretation of Error Criteria , 2009 .

[29]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[30]  Ira Assent,et al.  Relevant Subspace Clustering: Mining the Most Interesting Non-redundant Concepts in High Dimensional Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[31]  Mark D. Plumbley,et al.  A measure of statistical complexity based on predictive information , 2010, ArXiv.

[32]  Ming Zhang,et al.  A new interpretation on the MMSE as a robust MEE criterion , 2010, Signal Process..

[33]  Gregory B. Gloor,et al.  Mutual information is critically dependent on prior assumptions: would the correct estimate of mutual information please identify itself? , 2010, Bioinform..

[34]  Mokshay M. Madiman,et al.  Information Inequalities for Joint Distributions, With Interpretations and Applications , 2008, IEEE Transactions on Information Theory.

[35]  Reza Modarres,et al.  Measures of Dependence , 2011, International Encyclopedia of Statistical Science.

[36]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[37]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[38]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[39]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[40]  José Carlos Príncipe,et al.  Conditional Association , 2012, Neural Computation.

[41]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[42]  Barnabás Póczos,et al.  Copula-based Kernel Dependency Measures , 2012, ICML.

[43]  Bernhard Schölkopf,et al.  The Randomized Dependence Coefficient , 2013, NIPS.

[44]  Hadi Fanaee-T,et al.  Event labeling combining ensemble detectors and background knowledge , 2014, Progress in Artificial Intelligence.

[45]  Jun Fan,et al.  Learning theory approach to minimum error entropy criterion , 2012, J. Mach. Learn. Res..

[46]  Klemens Böhm,et al.  CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection , 2013, SDM.

[47]  G. V. Steeg Non-parametric Entropy Estimation Toolbox (NPEET) , 2014 .

[48]  Klemens Böhm,et al.  Multivariate Maximal Correlation Analysis , 2014, ICML.

[49]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[50]  Jose C. Principe,et al.  Measures of Entropy From Data Using Infinitely Divisible Kernels , 2012, IEEE Transactions on Information Theory.

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[53]  S. Holmes,et al.  Measuring multivariate association and beyond. , 2016, Statistics surveys.

[54]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[55]  James Bailey,et al.  Measuring dependency via intrinsic dimensionality , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[56]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[57]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[58]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[59]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.

[60]  James Bailey,et al.  Unbiased Multivariate Correlation Analysis , 2017, AAAI.

[61]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[62]  Arthur Gretton,et al.  Large-scale kernel methods for independence testing , 2016, Statistics and Computing.

[63]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[64]  Badong Chen,et al.  Insights Into the Robustness of Minimum Error Entropy Estimation , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[65]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[66]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[67]  Suchi Saria,et al.  Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport , 2018, AISTATS.

[68]  Robert Jenssen,et al.  Multivariate Extension of Matrix-based Renyi's α-order Entropy Functional , 2020, IEEE transactions on pattern analysis and machine intelligence.

[69]  David H. Wolpert,et al.  Nonlinear Information Bottleneck , 2017, Entropy.

[70]  Chi Wang,et al.  Fast Approximation of Empirical Entropy via Subsampling , 2019, KDD.

[71]  Uri Shalit,et al.  Robust learning with the Hilbert-Schmidt independence criterion , 2019, ICML.

[72]  Diane J. Cook,et al.  A Survey of Unsupervised Deep Domain Adaptation , 2018, ACM Trans. Intell. Syst. Technol..

[73]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74]  Tim Austin,et al.  Multi-variate correlation and mixtures of product measures , 2018, Kybernetika.

[75]  Robert Jenssen,et al.  Understanding Convolutional Neural Networks With Information Theory: An Initial Exploration , 2018, IEEE Transactions on Neural Networks and Learning Systems.