Measuring Dependence Powerfully and Equitably

Given a high-dimensional data set we often wish to find the strongest relationships within it. A common strategy is to evaluate a measure of dependence on every variable pair and retain the highest-scoring pairs for follow-up. This strategy works well if the statistic used is equitable [Reshef et al. 2015a], i.e., if, for some measure of noise, it assigns similar scores to equally noisy relationships regardless of relationship type (e.g., linear, exponential, periodic). In this paper, we introduce and characterize a population measure of dependence called MIC*. We show three ways that MIC* can be viewed: as the population value of MIC, a highly equitable statistic from [Reshef et al. 2011], as a canonical "smoothing" of mutual information, and as the supremum of an infinite sequence defined in terms of optimal one-dimensional partitions of the marginals of the joint distribution. Based on this theory, we introduce an efficient approach for computing MIC* from the density of a pair of random variables, and we define a new consistent estimator MICe for MIC* that is efficiently computable. In contrast, there is no known polynomial-time algorithm for computing the original equitable statistic MIC. We show through simulations that MICe has better bias-variance properties than MIC. We then introduce and prove the consistency of a second statistic, TICe, that is a trivial side-product of the computation of MICe and whose goal is powerful independence testing rather than equitability. We show in simulations that MICe and TICe have good equitability and power against independence respectively. The analyses here complement a more in-depth empirical evaluation of several leading measures of dependence [Reshef et al. 2015b] that shows state-of-the-art performance for MICe and TICe.

[1]  David N. Reshef,et al.  Equitability, interval estimation, and statistical power , 2015, Statistical Science.

[2]  T. Speed A Correlation for the 21st Century , 2011, Science.

[3]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Imre Csiszár,et al.  Information Theory and Statistics: A Tutorial , 2004, Found. Trends Commun. Inf. Theory.

[5]  Maria L. Rizzo,et al.  Brownian distance covariance , 2009, 1010.0297.

[6]  Bernhard Schölkopf,et al.  The Randomized Dependence Coefficient , 2013, NIPS.

[7]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[8]  Malka Gorfine,et al.  Comment on “ Detecting Novel Associations in Large Data Sets ” , 2012 .

[9]  M. Roulston Estimating the errors on measured entropy and mutual information , 1999 .

[10]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[11]  H. Stefánsson,et al.  Genetics of gene expression and its effect on disease , 2008, Nature.

[12]  A. Rényi On measures of dependence , 1959 .

[13]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[14]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[15]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[16]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[17]  Bo Jiang,et al.  Nonparametric K-Sample Tests via Dynamic Slicing , 2015 .

[18]  Daniel S. Murrell,et al.  R2-equitability is satisfiable , 2014, Proceedings of the National Academy of Sciences.

[19]  Yi Li,et al.  Copula Correlation: An Equitable Dependence Measure and Extension of Pearson's Correlation , 2013, 1312.7214.

[20]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[21]  Imre Csiszár,et al.  Axiomatic Characterizations of Information Measures , 2008, Entropy.

[22]  Malka Gorfine,et al.  Consistent Distribution-Free $K$-Sample and Independence Tests for Univariate Random Variables , 2014, J. Mach. Learn. Res..

[23]  Michael Mitzenmacher,et al.  An Empirical Study of Leading Measures of Dependence , 2015, ArXiv.

[24]  J. Kinney,et al.  Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[25]  W. Hoeffding A Non-Parametric Test of Independence , 1948 .

[26]  Michael Mitzenmacher,et al.  Cleaning up the record on the maximal information coefficient and equitability , 2014, Proceedings of the National Academy of Sciences.

[27]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[28]  R. Heller,et al.  A consistent multivariate test of association based on ranks of distances , 2012, 1201.3522.

[29]  Arnold Neumaier,et al.  Introduction to Numerical Analysis , 2001 .

[30]  R. Tibshirani,et al.  Comment on "Detecting Novel Associations In Large Data Sets" by Reshef Et Al, Science Dec 16, 2011 , 2014, 1401.7645.

[31]  E. H. Linfoot An Informational Measure of Correlation , 1957, Inf. Control..

[32]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[33]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[34]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[35]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.