A general asymptotic framework for distribution‐free graph‐based two‐sample tests

Testing equality of two multivariate distributions is a classical problem for which many non‐parametric tests have been proposed over the years. Most of the popular two‐sample tests, which are asymptotically distribution free, are based either on geometric graphs constructed by using interpoint distances between the observations (multivariate generalizations of the Wald–Wolfowitz runs test) or on multivariate data depth (generalizations of the Mann–Whitney rank test). The paper introduces a general notion of distribution‐free graph‐based two‐sample tests and provides a unified framework for analysing and comparing their asymptotic properties. The asymptotic (Pitman) efficiency of a general graph‐based test is derived, which includes tests based on geometric graphs, such as the Friedman–Rafsky test, the test based on the K‐nearest‐neighbour graph, the cross‐match test and the generalized edge count test, as well as tests based on multivariate depth functions (the Liu–Singh rank sum statistic). The results show how the combinatorial properties of the underlying graph affect the performance of the associated two‐sample test and can be used to validate and decide which tests to use in practice. Applications of the results are illustrated both on synthetic and on real data sets.

[1]  J. Wolfowitz,et al.  On a Test Whether Two Samples are from the Same Population , 1940 .

[2]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[3]  A. Mood On the Asymptotic Efficiency of Certain Nonparametric Two-Sample Tests , 1954 .

[4]  Lionel Weiss,et al.  Two-Sample Tests for Multivariate Distributions , 1960 .

[5]  P. Bickel A Distribution Free Version of the Smirnov Two Sample Test in the $p$-Variate Case , 1969 .

[6]  J. Tukey Mathematics and the Picturing of Data , 1975 .

[7]  Franco P. Preparata,et al.  The Densest Hemisphere Problem , 1978, Theor. Comput. Sci..

[8]  Lars Holst,et al.  Two Conditional Limit Theorems with Applications , 1979 .

[9]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[10]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[11]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[12]  M. Schilling Multivariate Two-Sample Tests Based on Nearest Neighbors , 1986 .

[13]  J. Steele,et al.  On the number of leaves of a euclidean minimal spanning tree , 1987, Journal of Applied Probability.

[14]  B. M. Brown,et al.  Affine Invariant Rank Methods in the Bivariate Location Model , 1987 .

[15]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .

[16]  Regina Y. Liu On a Notion of Data Depth Based on Random Simplices , 1990 .

[17]  Ronald H. Randles,et al.  Multivariate rank tests for the two-sample location problem , 1990 .

[18]  J. Michael Steele,et al.  Asymptotics for Euclidean minimal spanning trees on random points , 1992 .

[19]  Regina Y. Liu,et al.  A Quality Index Based on Data Depth and Multivariate Rank Tests , 1993 .

[20]  Edoardo Amaldi,et al.  The Complexity and Approximability of Finding Maximum Feasible Subsystems of Linear Relations , 1995, Theor. Comput. Sci..

[21]  R. Bartoszynski,et al.  Reducing multidimensional two-sample data to one-dimensional interpoint comparisons , 1996 .

[22]  J. Marden,et al.  An Approach to Multivariate Rank Tests in Multivariate Analysis of Variance , 1997 .

[23]  Hannu Oja,et al.  AFFINE INVARIANT MULTIVARIATE RANK TESTS FOR SEVERAL SAMPLES , 1998 .

[24]  Hannu Oja,et al.  On the Efficiency of Affine Invariant Multivariate Rank Tests , 1998 .

[25]  N. Henze,et al.  On the multivariate runs test , 1999 .

[26]  R. Serfling,et al.  General notions of statistical depth function , 2000 .

[27]  J. Yukich,et al.  Central limit theorems for some graphs in computational geometry , 2001 .

[28]  Valentin Rousson,et al.  On Distribution-Free Tests for the Multivariate Two-Sample Location-Scale Model , 2002 .

[29]  P. Hall,et al.  Permutation tests for equality of distributions in high‐dimensional settings , 2002 .

[30]  J. Yukich,et al.  Weak laws of large numbers in geometric probability , 2003 .

[31]  G. Zech,et al.  New test for the multivariate two-sample problem based on the concept of minimum energy , 2003 .

[32]  Y. Zuo Projection-based depth functions and associated medians , 2003 .

[33]  L. Baringhaus,et al.  On a new multivariate two-sample test , 2004 .

[34]  Louis H. Y. Chen,et al.  Normal approximation under local dependence , 2004, math/0410104.

[35]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[36]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[37]  Xuming He,et al.  On the limiting distributions of multivariate depth-based rank sum statistics and related tests , 2006, 0708.0167.

[38]  G. Reinert,et al.  Multivariate normal approximation with Stein’s method of exchangeable pairs under a general linearity condition , 2007, 0711.1082.

[39]  H. Oja Multivariate Nonparametric Methods with R: An approach based on spatial signs and ranks , 2010 .

[40]  H. Oja Multivariate Nonparametric Methods with R , 2010 .

[41]  Subhabrata Chakraborti,et al.  Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[42]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[43]  Hao Chen,et al.  Graph-based change-point detection , 2012, 1209.1625.

[44]  P. Diaconis,et al.  Universal Poisson and Normal Limit Theorems in Graph Coloring Problems With Connections to Extremal Combinatorics , 2013 .

[45]  Uwe Mönks,et al.  Sensorless drive diagnosis using automated feature extraction, significance ranking and reduction , 2013, 2013 IEEE 18th Conference on Emerging Technologies & Factory Automation (ETFA).

[46]  Jerome H. Friedman,et al.  A New Graph-Based Two-Sample Test for Multivariate and Object Data , 2013, 1307.6294.

[47]  Anil K. Ghosh,et al.  A DISTRIBUTION-FREE TWO-SAMPLE RUN TEST APPLICABLE TO HIGH DIMENSIONAL SMALL SAMPLE SIZE DATA , 2013 .

[48]  P. Diaconis,et al.  Universal Limit Theorems in Graph Coloring Problems With Connections to Extremal Combinatorics , 2013, 1310.2336.

[49]  Anil K. Ghosh,et al.  A distribution-free two-sample run test applicable to high-dimensional data , 2014 .

[50]  B. Bhattacharya Distribution of Two-Sample Tests Based on Geometric Graphs , 2015 .

[51]  Ery Arias-Castro,et al.  On the Consistency of the Crossmatch Test , 2015, 1509.05790.

[52]  Arthur Gretton,et al.  Fast Two-Sample Testing with Analytic Representations of Probability Measures , 2015, NIPS.

[53]  Peter J. Rousseeuw,et al.  Statistical depth meets computational geometry: a short survey , 2015, 1508.03828.

[54]  Marie Schmidt,et al.  Nonparametrics Statistical Methods Based On Ranks , 2016 .

[55]  Hao Chen,et al.  A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data , 2016, Journal of the American Statistical Association.

[56]  B. Bhattacharya Asymptotic distribution and detection thresholds for two-sample tests based on geometric graphs , 2020 .

[57]  Stefan Steinerberger,et al.  Randomized Near Neighbor Graphs, Giant Components, and Applications in Data Science , 2017, ArXiv.