Graph-theoretic multisample tests of equality in distribution for high dimensional data

Testing whether two or more independent samples arise from a common distribution is a classic problem in statistics. Several multivariate two-sample tests of equality are based on graphs such as the minimum spanning tree, nearest neighbor, and optimal nonbipartite perfect matching. Here, the samples are pooled and the test statistic is the number of edges in the graph that connect points with different sample identities. These tests are typically unbiased and perform well when estimates of underlying probability densities are poor. However, these tests have not been thoroughly studied when data is very high dimensional or in the multisample case. We introduce the use of orthogonal perfect matchings for testing equality in distribution. A suite of Monte Carlo simulations on artificial and real data shows that orthogonal perfect matchings and spanning trees typically have higher power than other graphs and are also more effective at discerning when samples have differences in their covariance structure compared to other nonparametric tests such as the energy and triangle tests. We test whether two or more samples have arisen from the same multivariate density.Test uses new graphs as well as the minimum spanning tree or nearest neighbors.Test performs well and is easy to perform for large d , small n datasets.Power of the new tests is competitive or beats other general multivariate tests.Mean and variance of the asymptotically normal null distribution are easy to compute.

[1]  Allan Borodin,et al.  Lower bounds for high dimensional nearest neighbor search and related problems , 1999, STOC '99.

[2]  Reza Modarres,et al.  A triangle test for equality of distribution functions in high dimensions , 2011 .

[3]  Dale L. Zimmerman,et al.  A Bivariate Cramer-von Mises Type of Test for Spatial Randomness , 1993 .

[4]  M. Schilling Multivariate Two-Sample Tests Based on Nearest Neighbors , 1986 .

[5]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .

[6]  H. E. Daniels,et al.  The Relation Between Measures of Correlation in the Universe of Sample Permutations , 1944 .

[7]  D. Pham,et al.  Asymptotic normality of double-indexed linear permutation statistics , 1989 .

[8]  Valentin Rousson,et al.  On Distribution-Free Tests for the Multivariate Two-Sample Location-Scale Model , 2002 .

[9]  J. Friedman,et al.  Graph-Theoretic Measures of Multivariate Association and Prediction , 1983 .

[10]  David W. Scott,et al.  Feasibility of multivariate density estimates , 1991 .

[11]  Maria L. Rizzo,et al.  DISCO analysis: A nonparametric extension of analysis of variance , 2010, 1011.2288.

[12]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[13]  H. Wool THE RELATION BETWEEN MEASURES OF CORRELATION IN THE UNIVERSE OF SAMPLE PERMUTATIONS , 1944 .

[14]  Xinyi Xu,et al.  Optimal Nonbipartite Matching and Its Statistical Applications , 2011, The American statistician.

[15]  L. Baringhaus,et al.  On a new multivariate two-sample test , 2004 .

[16]  ZVI GALIL,et al.  Efficient algorithms for finding maximum matching in graphs , 1986, CSUR.

[17]  P. Bickel,et al.  Sums of Functions of Nearest Neighbor Distances, Moment Bounds, Limit Theorems and a Goodness of Fit Test , 1983 .

[18]  Dan Nettleton,et al.  Testing the equality of distributions of random vectors with categorical components , 2001 .

[19]  M. Hill,et al.  Nonlinear Multivariate Analysis. , 1990 .

[20]  Sanjay Ranka,et al.  Statistical change detection for multi-dimensional data , 2007, KDD '07.

[21]  Matthew B. Squire,et al.  A Multivariate Two-Sample Test Using the Voronoi Diagram , 2003 .

[22]  N. Henze,et al.  On the multivariate runs test , 1999 .

[23]  N. H. Anderson,et al.  Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates , 1994 .

[24]  Adam Petrie Spanning trees as tools for data analysis , 2007 .

[25]  Robert V. Foutz,et al.  Tests for the multivariate two‐sample problem based on empirical probability measures , 1987 .

[26]  R. Zamar,et al.  A multivariate Kolmogorov-Smirnov test of goodness of fit , 1997 .

[27]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[28]  Jinfang Wang,et al.  Testing the Equality of Multivariate Distributions Using the Bootstrap and Integrated Empirical Processes , 2006 .

[29]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[30]  G. Székely,et al.  TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION , 2004 .