A new framework for distance and kernel-based metrics in high dimensions

The paper presents new metrics to quantify and test for (i) the equality of distributions and (ii) the independence between two high-dimensional random vectors. We show that the energy distance based on the usual Euclidean distance cannot completely characterize the homogeneity of two high-dimensional distributions in the sense that it only detects the equality of means and the traces of covariance matrices in the high-dimensional setup. We propose a new class of metrics which inherits the desirable properties of the energy distance and maximum mean discrepancy/(generalized) distance covariance and the Hilbert-Schmidt Independence Criterion in the low-dimensional setting and is capable of detecting the homogeneity of/completely characterizing independence between the low-dimensional marginal distributions in the high dimensional setup. We further propose t-tests based on the new metrics to perform high-dimensional two-sample testing/independence testing and study their asymptotic behavior under both high dimension low sample size (HDLSS) and high dimension medium sample size (HDMSS) setups. The computational complexity of the t-tests only grows linearly with the dimension and thus is scalable to very high dimensional data. We demonstrate the superior power behavior of the proposed tests for homogeneity of distributions and independence via both simulated and real datasets.

[1]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[2]  P. Doukhan,et al.  A new weak dependence condition and applications to moment inequalities , 1999 .

[3]  Barnabás Póczos,et al.  Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing , 2015, ArXiv.

[4]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[5]  H. T. David A Three-Sample Kolmogorov-Smirnov Test , 1958 .

[6]  Jerome H. Friedman,et al.  A New Graph-Based Two-Sample Test for Multivariate and Object Data , 2013, 1307.6294.

[7]  P. Phillips,et al.  Linear Regression Limit Theory for Nonstationary Panel Data , 1999 .

[8]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[9]  P. Sen Almost Sure Convergence of Generalized $U$-Statistics , 1977 .

[10]  Konstantinos Fokianos,et al.  An Updated Literature Review of Distance Correlation and Its Applications to Time Series , 2017, International Statistical Review.

[11]  X. Shao,et al.  Conditional mean and quantile dependence testing in high dimension , 2017, 1701.08697.

[12]  N. Cressie,et al.  The Moment-Generating Function and Negative Integer Moments , 1981 .

[13]  Jun Li Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem , 2018, Biometrika.

[14]  Xiaoming Huo,et al.  Fast Computing for Distance Covariance , 2014, Technometrics.

[15]  L. Baringhaus,et al.  On a new multivariate two-sample test , 2004 .

[16]  P. Bickel A Distribution Free Version of the Smirnov Two Sample Test in the $p$-Variate Case , 1969 .

[17]  G. Neuhaus Functional limit theorems for U-statistics in the degenerate case , 1977 .

[18]  B. Schölkopf,et al.  Kernel‐based tests for joint independence , 2016, 1603.00285.

[19]  L. Wasserman,et al.  Robust Multivariate Nonparametric Tests via Projection-Pursuit , 2018, 1803.00715.

[20]  R. Serfling Approximation Theorems of Mathematical Statistics , 1980 .

[21]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[22]  R. C. Bradley Basic properties of strong mixing conditions. A survey and some open questions , 2005, math/0511078.

[23]  P. Hall,et al.  Martingale Limit Theory and Its Application , 1980 .

[24]  R. Lyons Distance covariance in metric spaces , 2011, 1106.5758.

[25]  S. Resnick A Probability Path , 1999 .

[26]  Emmanuel Rio,et al.  Covariance inequalities for strongly mixing processes , 1993 .

[27]  Michael H. Neumann,et al.  The notion of ψ -weak dependence and its applications to bootstrapping time series , 2008, 0806.4263.

[28]  Xiaofeng Shao,et al.  Distance-based and RKHS-based dependence metrics in high dimension , 2019, 1902.03291.

[29]  David S. Matteson,et al.  Independent Component Analysis via Distance Covariance , 2013, 1306.4911.

[30]  Xianyang Zhang,et al.  Distance Metrics for Measuring Joint Dependence with Application to Causal Inference , 2017, Journal of the American Statistical Association.

[31]  Wicher P. Bergsma,et al.  A consistent test of independence based on a sign covariance related to Kendall's tau , 2010, 1007.4259.

[32]  Maria L. Rizzo,et al.  Partial Distance Correlation with Methods for Dissimilarities , 2013, 1310.2926.

[33]  D. Darling The Kolmogorov-Smirnov, Cramer-von Mises Tests , 1957 .

[34]  X. Shao,et al.  Testing mutual independence in high dimension via distance covariance , 2016, 1609.09380.

[35]  R. Bartoszynski,et al.  Reducing multidimensional two-sample data to one-dimensional interpoint comparisons , 1996 .