Discovering General Multidimensional Associations

When two variables are related by a known function, the coefficient of determination (denoted R2) measures the proportion of the total variance in the observations explained by that function. For linear relationships, this is equal to the square of the correlation coefficient, ρ. When the parametric form of the relationship is unknown, however, it is unclear how to estimate the proportion of explained variance equitably—assigning similar values to equally noisy relationships. Here we demonstrate how to directly estimate a generalised R2 when the form of the relationship is unknown, and we consider the performance of the Maximal Information Coefficient (MIC)—a recently proposed information theoretic measure of dependence. We show that our approach behaves equitably, has more power than MIC to detect association between variables, and converges faster with increasing sample size. Most importantly, our approach generalises to higher dimensions, estimating the strength of multivariate relationships (Y against A, B, …) as well as measuring association while controlling for covariates (Y against X controlling for C). An R package named matie (“Measuring Association and Testing Independence Efficiently”) is available (http://cran.r-project.org/web/packages/matie/).

[1]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[2]  E. H. Linfoot An Informational Measure of Correlation , 1957, Inf. Control..

[3]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[4]  David R. Cox The analysis of binary data , 1970 .

[5]  L. Magee,et al.  R 2 Measures Based on Wald and Likelihood Ratio Joint Significance Tests , 1990 .

[6]  N. Nagelkerke,et al.  A note on a general definition of the coefficient of determination , 1991 .

[7]  Pascal Vincent,et al.  Manifold Parzen Windows , 2002, NIPS.

[8]  Axel Werwatz,et al.  Nonparametric Density Estimation , 2004 .

[9]  Maria L. Rizzo,et al.  Brownian distance covariance , 2009, 1010.0297.

[10]  Masashi Sugiyama,et al.  Mutual information approximation via maximum likelihood estimation of density ratio , 2009, 2009 IEEE International Symposium on Information Theory.

[11]  J. Wertheim The Re-Emergence of H1N1 Influenza Virus in 1977: A Cautionary Tale for Estimating Divergence Times Using Biologically Unrealistic Sampling Dates , 2010, PloS one.

[12]  Richard A. Davis,et al.  Remarks on Some Nonparametric Estimates of a Density Function , 2011 .

[13]  T. Speed A Correlation for the 21st Century , 2011, Science.

[14]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[15]  R. Heller,et al.  A consistent multivariate test of association based on ranks of distances , 2012, 1201.3522.

[16]  Daniel S. Murrell,et al.  R2-equitability is satisfiable , 2014, Proceedings of the National Academy of Sciences.

[17]  J. Kinney,et al.  Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[18]  Michael Mitzenmacher,et al.  An Empirical Study of Leading Measures of Dependence , 2015, ArXiv.