A Sequential Non-Parametric Multivariate Two-Sample Test

Given samples from two distributions, a non-parametric two-sample test aims at determining whether the two distributions are equal or not, based on a test statistic. Classically, this statistic is computed on the whole data set, or is computed on a subset of the data set by a function trained on its complement. We consider methods in a third tier, so as to deal with large (possibly infinite) data sets, and to automatically determine the most relevant scales to work at, making two contributions. First, we develop a generic sequential non-parametric testing framework, in which the sample size need not be fixed in advance. This makes our test a truly sequential non-parametric multivariate two-sample test. Under information theoretic conditions qualifying the difference between the tested distributions, consistency of the two-sample test is established. Second, we instantiate our framework using nearest neighbor regressors, and show how the power of the resulting two-sample test can be improved using Bayesian mixtures and switch distributions. This combination of techniques yields automatic scale selection, and experiments performed on challenging data sets show that our sequential tests exhibit comparable performances to those of state-of-the-art non-sequential tests.

[1]  Fernando Pérez-Cruz,et al.  On the uncertainty in sequential hypothesis testing , 2008, 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[2]  Fernando Pérez-Cruz,et al.  Kullback-Leibler divergence estimation of continuous distributions , 2008, 2008 IEEE International Symposium on Information Theory.

[3]  P. Bickel A Distribution Free Version of the Smirnov Two Sample Test in the $p$-Variate Case , 1969 .

[4]  G. Shafer,et al.  Test Martingales, Bayes Factors and p-Values , 2009, 0912.4269.

[5]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[6]  P. Algoet UNIVERSAL SCHEMES FOR PREDICTION, GAMBLING AND PORTFOLIO SELECTION' , 1992 .

[7]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[8]  Jean-Luc Ville Étude critique de la notion de collectif , 1939 .

[9]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[10]  P. Grünwald,et al.  Almost the best of three worlds: Risk, consistency and optional stopping for the switch criterion in nested model selection , 2018 .

[11]  L. Breiman The Individual Ergodic Theorem of Information Theory , 1957 .

[12]  Stéphan Clémençon,et al.  AUC optimization and the two-sample problem , 2009, NIPS.

[13]  H. Robbins,et al.  Some nonparametric sequential tests with power one. , 1968, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Wojciech Zaremba,et al.  B-test: A Non-parametric, Low Variance Kernel Two-sample Test , 2013, NIPS.

[15]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[16]  László Györfi,et al.  On the asymptotic properties of a nonparametric L/sub 1/-test statistic of homogeneity , 2005, IEEE Transactions on Information Theory.

[17]  P. Grünwald,et al.  Catching up faster by switching sooner: a predictive approach to adaptive estimation with an application to the AIC–BIC dilemma , 2012 .

[18]  P. Hall,et al.  Permutation tests for equality of distributions in high‐dimensional settings , 2002 .

[19]  Barnabás Póczos,et al.  On the High Dimensional Power of a Linear-Time Two Sample Test under Mean-shift Alternatives , 2015, AISTATS.

[20]  R. Khan,et al.  Sequential Tests of Statistical Hypotheses. , 1972 .

[21]  E. Wagenmakers A practical solution to the pervasive problems ofp values , 2007, Psychonomic bulletin & review.

[22]  Van Erven,et al.  When Data Compression and Statistics Disagree: Two Frequentist Challenges for the Minimum Description Length Principle , 2010 .

[23]  J. Friedman On Multivariate Goodness-of-Fit and Two-Sample Testing , 2004 .

[24]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[25]  Sivaraman Balakrishnan,et al.  Optimal kernel choice for large-scale two-sample tests , 2012, NIPS.

[26]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[27]  M. Schilling Multivariate Two-Sample Tests Based on Nearest Neighbors , 1986 .

[28]  Arthur Gretton,et al.  Fast Two-Sample Testing with Analytic Representations of Probability Measures , 2015, NIPS.

[29]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[30]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .