Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for any difference in distributions. A large number of test statistics have been proposed for both these settings. This paper connects three classes of statistics - high dimensional variants of Hotelling's t-test, statistics based on Reproducing Kernel Hilbert Spaces, and energy statistics based on pairwise distances. We ask the question: how much statistical power do popular kernel and distance based tests for GDA have when the unknown distributions differ in their means, compared to specialized tests for MDA? We formally characterize the power of popular tests for GDA like the Maximum Mean Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance with the Euclidean norm (eED) in the high-dimensional MDA regime. Some practically important properties include (a) eED and gMMD have asymptotically equal power; furthermore they enjoy a free lunch because, while they are additionally consistent for GDA, they also have the same power as specialized high-dimensional t-test variants for MDA. All these tests are asymptotically optimal (including matching constants) under MDA for spherical covariances, according to simple lower bounds, (b) The power of gMMD is independent of the kernel bandwidth, as long as it is larger than the choice made by the median heuristic, (c) There is a clear and smooth computation-statistics tradeoff for linear-time, subquadratic-time and quadratic-time versions of these tests, with more computation resulting in higher power.

[1]  Sivaraman Balakrishnan,et al.  Optimal kernel choice for large-scale two-sample tests , 2012, NIPS.

[2]  Martin J. Wainwright,et al.  A More Powerful Two-Sample Test in High Dimensions using Random Projection , 2011, NIPS.

[3]  L. Baringhaus,et al.  On a new multivariate two-sample test , 2004 .

[4]  N. H. Anderson,et al.  Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates , 1994 .

[5]  Aman Ullah,et al.  Finite Sample Econometrics , 2004 .

[6]  T. Kariya A Robustness Property of Hotelling's $T^2$-Test , 1981 .

[7]  Marie Schmidt,et al.  Nonparametrics Statistical Methods Based On Ranks , 2016 .

[8]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[9]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[10]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .

[11]  AN Kolmogorov-Smirnov,et al.  Sulla determinazione empírica di uma legge di distribuzione , 1933 .

[12]  Jan R. Magnus,et al.  The expectation of products of quadratic forms in normal variables: The practice Statistica Neerlandica , 1979 .

[13]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[14]  J. B. Simaika ON AN OPTIMUM PROPERTY OF TWO IMPORTANT STATISTICAL TESTS , 1941 .

[15]  Yu. I. Ingster,et al.  Nonparametric Goodness-of-Fit Testing Under Gaussian Models , 2002 .

[16]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[17]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[18]  T. W. Anderson,et al.  Asymptotic Theory of Certain "Goodness of Fit" Criteria Based on Stochastic Processes , 1952 .

[19]  Maurice G. Kendall,et al.  The advanced theory of statistics , 1945 .

[20]  M. Schilling Multivariate Two-Sample Tests Based on Nearest Neighbors , 1986 .

[21]  Joaquín Muñoz-García,et al.  A test for the two-sample problem based on empirical characteristic functions , 2008, Comput. Stat. Data Anal..

[22]  R. Serfling Approximation Theorems of Mathematical Statistics , 1980 .

[23]  P. Hall,et al.  Martingale Limit Theory and its Application. , 1984 .

[24]  J. Wolfowitz,et al.  On a Test Whether Two Samples are from the Same Population , 1940 .

[25]  Helly Wahrscheinlichkeit, Statistik und Wahrheit , 1936 .

[26]  S. S. Wilks,et al.  The Advanced Theory of Statistics. I. Distribution Theory , 1959 .

[27]  Barnabás Póczos,et al.  On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests in High Dimensions , 2014, AAAI.

[28]  A. Martin-Löf On the composition of elementary errors , 1994 .

[29]  T. W. Anderson,et al.  An Introduction to Multivariate Statistical Analysis , 1959 .

[30]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .

[31]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[32]  P. Bickel A Distribution Free Version of the Smirnov Two Sample Test in the $p$-Variate Case , 1969 .

[33]  H. Hotelling The Generalization of Student’s Ratio , 1931 .

[34]  Barnabás Póczos,et al.  On the High Dimensional Power of a Linear-Time Two Sample Test under Mean-shift Alternatives , 2015, AISTATS.

[35]  Wojciech Zaremba,et al.  B-test: A Non-parametric, Low Variance Kernel Two-sample Test , 2013, NIPS.

[36]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[37]  O. V. Shalaevskii Minimax Character of Hotelling’s T2 Test. I , 1971 .

[38]  Weidong Liu,et al.  Two‐sample test of high dimensional means under dependence , 2014 .

[39]  R. Lyons Distance covariance in metric spaces , 2011, 1106.5758.

[40]  A. Belloni,et al.  On the Behrens–Fisher problem: A globally convergent algorithm and a finite-sample study of the Wald, LR and LM tests , 2008, 0811.0672.

[41]  A. Dempster A HIGH DIMENSIONAL TWO SAMPLE SIGNIFICANCE TEST , 1958 .

[42]  Noureddine El Karoui,et al.  The spectrum of kernel random matrices , 2010, 1001.0492.

[43]  A. Ullah,et al.  Expectation of quadratic forms in normal and nonnormal variables with applications , 2010 .

[44]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[45]  Muni S. Srivastava,et al.  A two sample test in high dimensional data , 2013, Journal of Multivariate Analysis.

[46]  Song-xi Chen,et al.  A two-sample test for high-dimensional data with applications to gene-set testing , 2010, 1002.4547.

[47]  Z. Bai,et al.  EFFECT OF HIGH DIMENSION: BY AN EXAMPLE OF A TWO SAMPLE PROBLEM , 1999 .

[48]  M. Srivastava,et al.  A test for the mean vector with fewer observations than the dimension , 2008 .

[49]  Kenji Fukumizu,et al.  Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[50]  G. Székely,et al.  TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION , 2004 .