Global and local two-sample tests via regression

Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data. Specifically, in the machine learning literature, there have been recent methodological developments such as classification accuracy tests. The goal of this work is to present a regression approach to comparing multivariate distributions of complex data. Depending on the chosen regression model, our framework can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios. Whereas previous work has been largely limited to global tests which conceal much of the local information, our approach naturally leads to a local two-sample testing framework in which we identify local differences between multivariate distributions with statistical confidence. We demonstrate the efficacy of our approach both theoretically and empirically, under some well-known parametric and nonparametric regression methods. Our proposed methods are applied to simulated data as well as a challenging astronomy data set to assess their practical usefulness.

[1]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[2]  T. W. Anderson,et al.  An Introduction to Multivariate Statistical Analysis , 1959 .

[3]  Luc Devroye,et al.  Lectures on the Nearest Neighbor Method , 2015 .

[4]  Alix Lheritier,et al.  Beyond two-sample-tests: Localizing data discrepancies in high-dimensional spaces , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[5]  Paolo Avesani,et al.  Statistical independence for the evaluation of classifier-based diagnosis , 2014, Brain Informatics.

[6]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[7]  High-order accurate methods for retrospective sampling problems , 1999 .

[8]  Denis Larocque,et al.  An empirical comparison of ensemble methods based on classification trees , 2003 .

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  E. Mammen,et al.  Comparing Nonparametric Versus Parametric Regression Fits , 1993 .

[11]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[12]  N. H. Anderson,et al.  Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates , 1994 .

[13]  Maria L. Rizzo,et al.  TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION , 2004 .

[14]  Gerhard Weihrather Testing a linear regression model against nonparametric alternatives , 1993 .

[15]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[16]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[17]  L. Baringhaus,et al.  On a new multivariate two-sample test , 2004 .

[18]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[19]  C. J. Conselice,et al.  New image statistics for detecting disturbed galaxy morphologies at high redshift , 2013, 1306.1238.

[20]  Christopher J. Conselice,et al.  The Relationship between Stellar Light Distributions of Galaxies and Their Formation Histories , 2003 .

[21]  Olivier Thas,et al.  Comparing Distributions , 2009 .

[22]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[23]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[24]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[25]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[26]  Jelle J. Goeman,et al.  Better-than-chance classification for signal detection. , 2016, Biostatistics.

[27]  William Alexander,et al.  Nonparametric Smoothing and Lack-of-Fit Tests , 1999, Technometrics.

[28]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[29]  Gemma C. Garriga,et al.  Permutation Tests for Studying Classifier Performance , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[30]  Alastair Scott,et al.  Maximum likelihood for generalised case-control studies , 2001 .

[31]  Stefan Wager,et al.  Adaptive Concentration of Regression Trees, with Application to Random Forests , 2015 .

[32]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[33]  Yu. I. Ingster Minimax Testing of Nonparametric Hypotheses on a Distribution Density in the $L_p$ Metrics , 1987 .

[34]  E. Bolthausen An estimate of the remainder in a combinatorial central limit theorem , 1984 .

[35]  Tarn Duong,et al.  Local significant differences from nonparametric two-sample tests , 2013 .

[36]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[37]  Wenceslao González-Manteiga,et al.  An updated review of Goodness-of-Fit tests for regression models , 2013, TEST.

[38]  W. González-Manteiga,et al.  Testing the hypothesis of a general linear model using nonparametric regression estimation , 1993 .

[39]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[40]  Holger Dette,et al.  A power comparison between nonparametric regression tests , 2004 .

[41]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[42]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[43]  Samory Kpotufe,et al.  k-NN Regression Adapts to Local Intrinsic Dimension , 2011, NIPS.

[44]  Adrian Barbu,et al.  Dimension reduction and variable selection in case control studies via regularized likelihood optimization , 2009, 0905.2171.

[45]  Larry A. Wasserman,et al.  Classification Accuracy as a Proxy for Two Sample Testing , 2016, The Annals of Statistics.

[46]  C. McBride,et al.  Galaxy morphology and star formation in the Illustris Simulation at z = 0 , 2015, 1502.07747.

[47]  J. Friedman On Multivariate Goodness-of-Fit and Two-Sample Testing , 2004 .

[48]  Johann A. Gagnon-Bartsch,et al.  The Classification Permutation Test: A Nonparametric Test for Equality of Multivariate Distributions , 2016, 1611.06408.

[49]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[50]  C. Conselice The Evolution of Galaxy Structure Over Cosmic Time , 2014, 1403.2783.

[51]  Y. Baraud Non-asymptotic minimax rates of testing in signal detection , 2002 .

[52]  Z. Bai,et al.  A review of 20 years of naive tests of significance for high-dimensional mean vectors and covariance matrices , 2016, 1603.01003.

[53]  A. Keziou,et al.  Test of homogeneity in semiparametric two-sample density ratio models , 2005 .

[54]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[55]  Vikas K. Garg,et al.  Adaptivity to Local Smoothness and Dimension in Kernel Regression , 2013, NIPS.

[56]  David Lopez-Paz,et al.  Revisiting Classifier Two-Sample Tests , 2016, ICLR.

[57]  K. Fokianos Comparing two samples by penalized logistic regression , 2008, 0807.2563.

[58]  J. Zheng,et al.  A consistent test of functional form via nonparametric estimation techniques , 1996 .

[59]  Raymond J. Carroll,et al.  On robust estimation in logistic case-control studies , 1993 .

[60]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[61]  Barnabás Póczos,et al.  Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing , 2015, ArXiv.

[62]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[63]  T. Ayano Rates of convergence for the k-nearest neighbor estimators with smoother regression functions , 2012 .

[64]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[65]  J. Anderson Separate sample logistic discrimination , 1972 .

[66]  P. Bickel,et al.  Local polynomial regression on unknown manifolds , 2007, 0708.0983.

[67]  P. Madau,et al.  A NEW NONPARAMETRIC APPROACH TO GALAXY MORPHOLOGICAL CLASSIFICATION , 2003, astro-ph/0311352.

[68]  Anil K. Ghosh,et al.  On high dimensional two-sample tests based on nearest neighbors , 2015, J. Multivar. Anal..