A Statistically and Numerically Efficient Independence Test Based on Random Projections and Distance Covariance

Testing for independence plays a fundamental role in many statistical techniques. Among the nonparametric approaches, the distance-based methods (such as the distance correlation-based hypotheses testing for independence) have many advantages, compared with many other alternatives. A known limitation of the distance-based method is that its computational complexity can be high. In general, when the sample size is n, the order of computational complexity of a distance-based method, which typically requires computing of all pairwise distances, can be O(n 2). Recent advances have discovered that in the univariate cases, a fast method with O(n log  n) computational complexity and O(n) memory requirement exists. In this paper, we introduce a test of independence method based on random projection and distance correlation, which achieves nearly the same power as the state-of-the-art distance-based approach, works in the multivariate cases, and enjoys the O(nK log  n) computational complexity and O( max{n, K}) memory requirement, where K is the number of random projections. Note that saving is achieved when K < n/ log  n. We name our method a Randomly Projected Distance Covariance (RPDC). The statistical theoretical analysis takes advantage of some techniques on the random projection which are rooted in contemporary machine learning. Numerical experiments demonstrate the efficiency of the proposed method, relative to numerous competitors.

[1]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[2]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[3]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[4]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[5]  B. Schweizer,et al.  On Nonparametric Measures of Dependence for Random Variables , 1981 .

[6]  R. Randles,et al.  Multivariate Nonparametric Tests of Independence , 2005 .

[7]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[8]  Martin J. Wainwright,et al.  A More Powerful Two-Sample Test in High Dimensions using Random Projection , 2011, NIPS.

[9]  K. Siburg,et al.  A measure of mutual complete dependence , 2010 .

[10]  Niall M. Adams,et al.  A comparison of efficient approximations for a weighted sum of chi-squared random variables , 2016, Stat. Comput..

[11]  Jianqing Fan,et al.  Distributions of angles in random packing on spheres , 2013, J. Mach. Learn. Res..

[12]  Bernhard Schölkopf,et al.  The Randomized Dependence Coefficient , 2013, NIPS.

[13]  Kenji Fukumizu,et al.  Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[14]  Heping Zhang,et al.  Conditional Distance Correlation , 2015, Journal of the American Statistical Association.

[15]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[16]  S. S. Wilks On the Independence of k Sets of Normally Distributed Statistical Variables , 1935 .

[17]  Malka Gorfine,et al.  Consistent Distribution-Free $K$-Sample and Independence Tests for Univariate Random Variables , 2014, J. Mach. Learn. Res..

[18]  Avrim Blum,et al.  Random Projection, Margins, Kernels, and Feature-Selection , 2005, SLSFS.

[19]  Bing Li,et al.  Variable selection via additive conditional independence , 2016 .

[20]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[21]  Runze Li,et al.  Model-Free Feature Screening for Ultrahigh-Dimensional Data , 2011, Journal of the American Statistical Association.

[22]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[23]  Larry Wasserman,et al.  TO PROBABILITY AND MATHEMATICAL STATISTICS , 2017 .

[24]  P. Sen,et al.  Nonparametric methods in multivariate analysis , 1974 .

[25]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[26]  W. Rudin Fourier Analysis on Groups: Rudin/Fourier , 1990 .

[27]  R. Lyons Distance covariance in metric spaces , 2011, 1106.5758.

[28]  G. Box Some Theorems on Quadratic Forms Applied in the Study of Analysis of Variance Problems, I. Effect of Inequality of Variance in the One-Way Classification , 1954 .

[29]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[30]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[31]  Matthew Reimherr,et al.  On Quantifying Dependence: A Framework for Developing Interpretable Measures , 2013, 1302.5233.

[32]  Bernhard Schölkopf,et al.  Sampling Techniques for Kernel Methods , 2001, NIPS.

[33]  Michael R Kosorok On Brownian Distance Covariance and High Dimensional Data. , 2009, The annals of applied statistics.

[34]  Xiaoming Huo,et al.  Fast Computing for Distance Covariance , 2014, Technometrics.

[35]  Xiaofeng Shao,et al.  Distance-based and RKHS-based dependence metrics in high dimension , 2019, 1902.03291.