On Wasserstein Two-Sample Testing and Related Families of Nonparametric Tests

Nonparametric two sample or homogeneity testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. The literature is old and rich, with a wide variety of statistics having being intelligently designed and analyzed, both for the unidimensional and the multivariate setting. Our contribution is to tie together many of these tests, drawing connections between seemingly very different statistics. In this work, our central object is the Wasserstein distance, as we form a chain of connections from univariate methods like the Kolmogorov-Smirnov test, PP/QQ plots and ROC/ODC curves, to multivariate tests involving energy statistics and kernel based maximum mean discrepancy. Some connections proceed through the construction of a \textit{smoothed} Wasserstein distance, and others through the pursuit of a "distribution-free" Wasserstein test. Some observations in this chain are implicit in the literature, while others seem to have not been noticed thus far. Given nonparametric two sample testing's classical and continued importance, we aim to provide useful connections for theorists and practitioners familiar with one subset of methods but not others.

[1]  H. Cramér On the composition of elementary errors , .

[2]  AN Kolmogorov-Smirnov,et al.  Sulla determinazione empírica di uma legge di distribuzione , 1933 .

[3]  Helly Wahrscheinlichkeit, Statistik und Wahrheit , 1936 .

[4]  J. Wolfowitz,et al.  On a Test Whether Two Samples are from the Same Population , 1940 .

[5]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .

[6]  T. W. Anderson,et al.  Asymptotic Theory of Certain "Goodness of Fit" Criteria Based on Stochastic Processes , 1952 .

[7]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[8]  Richard Sinkhorn Diagonal equivalence to matrices with prescribed row and column sums. II , 1967 .

[9]  P. Bickel A Distribution Free Version of the Smirnov Two Sample Test in the $p$-Variate Case , 1969 .

[10]  R. Dudley The Speed of Mean Glivenko-Cantelli Convergence , 1969 .

[11]  R. Dobrushin Prescribing a System of Random Variables by Conditional Distributions , 1970 .

[12]  C. Mallows A Note on Asymptotic Joint Normality , 1972 .

[13]  P. Major,et al.  An approximation of partial sums of independent RV's, and the sample DF. II , 1975 .

[14]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[15]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[16]  János Komlós,et al.  On optimal matchings , 1984, Comb..

[17]  Frank Thomson Leighton,et al.  Tight bounds for minimax grid matching, with applications to the average case analysis of algorithms , 1986, STOC '86.

[18]  M. Schilling Multivariate Two-Sample Tests Based on Nearest Neighbors , 1986 .

[19]  J. Wellner,et al.  Empirical Processes with Applications to Statistics , 2009 .

[20]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .

[21]  J. Lorenz,et al.  On the scaling of multidimensional matrices , 1989 .

[22]  J. Yukich,et al.  Minimax Grid Matching and Empirical Measures , 1991 .

[23]  A. Martin-Löf On the composition of elementary errors , 1994 .

[24]  B. Turnbull,et al.  NONPARAMETRIC AND SEMIPARAMETRIC ESTIMATION OF THE RECEIVER OPERATING CHARACTERISTIC CURVE , 1996 .

[25]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[26]  C. Czado,et al.  Nonparametric validation of similar distributions and assessment of goodness of fit , 1998 .

[27]  J. A. Cuesta-Albertos,et al.  Tests of goodness of fit based on the $L_2$-Wasserstein distance , 1999 .

[28]  M. Faddy,et al.  Likelihood Computations for Extended Poisson Process Models , 1999 .

[29]  J. A. Cuesta-Albertos,et al.  Contributions of empirical and quantile processes to the asymptotic theory of goodness-of-fit tests , 2000 .

[30]  Maria L. Rizzo,et al.  TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION , 2004 .

[31]  L. Baringhaus,et al.  On a new multivariate two-sample test , 2004 .

[32]  P. Rousseeuw,et al.  Wiley Series in Probability and Mathematical Statistics , 2005 .

[33]  E. Giné,et al.  Asymptotics for L2 functionals of the empirical quantile process, with applications to tests of fit based on weighted Wasserstein distances , 2005 .

[34]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[35]  J. Wellner,et al.  EMPIRICAL PROCESSES WITH APPLICATIONS TO STATISTICS (Wiley Series in Probability and Mathematical Statistics) , 1987 .

[36]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[37]  C. Matr,et al.  Tests of Goodness of Fit Based on the L 2 -wasserstein Distance , 2007 .

[38]  C. Czado,et al.  A nonparametric test for similarity of marginals—With applications to the assessment of population bioequivalence , 2007 .

[39]  J. A. Cuesta-Albertos,et al.  Trimmed Comparison of Distributions , 2008 .

[40]  Joaquín Muñoz-García,et al.  A test for the two-sample problem based on empirical characteristic functions , 2008, Comput. Stat. Data Anal..

[41]  C. Villani Optimal Transport: Old and New , 2008 .

[42]  Olivier Thas,et al.  Comparing Distributions , 2009 .

[43]  R. Bass,et al.  Review: P. Billingsley, Convergence of probability measures , 1971 .

[44]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[45]  J. A. Cuesta-Albertos,et al.  Similarity of samples and trimming , 2012, 1205.1950.

[46]  Kenji Fukumizu,et al.  Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[47]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[48]  Nicolás García Trillos,et al.  On the rate of convergence of empirical measures in $\infty$-transportation distance , 2014, 1407.1157.

[49]  Marie Schmidt,et al.  Nonparametrics Statistical Methods Based On Ranks , 2016 .

[50]  Ann. Probab Distance Covariance in Metric Spaces , 2017 .

[51]  C. Tsallis Entropy , 2022, Thermodynamic Weirdness.