A label efficient two-sample test

Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed bimodal query is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying. Source code for our algorithms and experimental results is available at https://github.com/wayne0908/ Label-Efficient-Two-Sample .

[1]  Tom Rainforth,et al.  Active Testing: Sample-Efficient Model Evaluation , 2021, ICML.

[2]  Tom Rainforth,et al.  Active Testing: Sample-Efficient Model Evaluation , 2021, ICML.

[3]  Larry A. Wasserman,et al.  Classification Accuracy as a Proxy for Two Sample Testing , 2016, The Annals of Statistics.

[4]  Edward R. Dougherty,et al.  Uncertainty-aware Active Learning for Optimal Bayesian Classifier , 2021, ICLR.

[5]  Karthikeyan Natesan Ramamurthy,et al.  Finding the Homology of Decision Boundaries with Active Learning , 2020, NeurIPS.

[6]  Karthikeyan Natesan Ramamurthy,et al.  Finding the Homology of Decision Boundaries with Active Learning , 2020, NeurIPS.

[7]  Tze Leung Lai,et al.  Adaptive enrichment designs for confirmatory trials , 2018, Statistics in medicine.

[8]  Tze Leung Lai,et al.  Adaptive enrichment designs for confirmatory trials , 2018, Statistics in medicine.

[9]  Frédéric Cazals,et al.  A Sequential Non-Parametric Multivariate Two-Sample Test , 2018, IEEE Transactions on Information Theory.

[10]  Frédéric Cazals,et al.  A Sequential Non-Parametric Multivariate Two-Sample Test , 2018, IEEE Transactions on Information Theory.

[11]  Hao Chen,et al.  A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data , 2016, Journal of the American Statistical Association.

[12]  Hao Chen,et al.  A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data , 2016, Journal of the American Statistical Association.

[13]  David Lopez-Paz,et al.  Revisiting Classifier Two-Sample Tests , 2016, ICLR.

[14]  David Lopez-Paz,et al.  Revisiting Classifier Two-Sample Tests , 2016, ICLR.

[15]  Jack Paparian Minimizing Queries for Active Labeling with Sequential Analysis , 2016 .

[16]  Robert D. Nowak,et al.  S2: An Efficient Graph Based Active Learning Algorithm with Application to Nonparametric Classification , 2015, COLT.

[17]  Robert D. Nowak,et al.  S2: An Efficient Graph Based Active Learning Algorithm with Application to Nonparametric Classification , 2015, COLT.

[18]  Alfred O. Hero,et al.  Empirical Non-Parametric Estimation of the Fisher Information , 2014, IEEE Signal Processing Letters.

[19]  Tze Leung Lai,et al.  Adaptive choice of patient subgroup for comparing two treatments. , 2014, Contemporary clinical trials.

[20]  Tze Leung Lai,et al.  Adaptive choice of patient subgroup for comparing two treatments. , 2014, Contemporary clinical trials.

[21]  N. Simon,et al.  Adaptive enrichment designs for clinical trials. , 2013, Biostatistics.

[22]  Jerome H. Friedman,et al.  A New Graph-Based Two-Sample Test for Multivariate and Object Data , 2013, 1307.6294.

[23]  Jerome H. Friedman,et al.  A New Graph-Based Two-Sample Test for Multivariate and Object Data , 2013, 1307.6294.

[24]  Tara Javidi,et al.  Active Sequential Hypothesis Testing , 2012, ArXiv.

[25]  Tara Javidi,et al.  Active Sequential Hypothesis Testing , 2012, ArXiv.

[26]  Takafumi Kanamori,et al.  $f$ -Divergence Estimation and Two-Sample Homogeneity Test Under Semiparametric Density-Ratio Models , 2010, IEEE Transactions on Information Theory.

[27]  S. Spies,et al.  ADNI-GO: 18F-AV-45 as an imaging bio-marker for Alzheimer's disease , 2011 .

[28]  S. Spies,et al.  ADNI-GO: 18F-AV-45 as an imaging bio-marker for Alzheimer's disease , 2011 .

[29]  Stéphan Clémençon,et al.  AUC optimization and the two-sample problem , 2009, NIPS.

[30]  Stéphan Clémençon,et al.  AUC optimization and the two-sample problem , 2009, NIPS.

[31]  Jay Bartroff,et al.  Efficient adaptive designs with mid‐course sample size adjustment in clinical trials , 2008, Statistics in medicine.

[32]  Jay Bartroff,et al.  Efficient adaptive designs with mid‐course sample size adjustment in clinical trials , 2008, Statistics in medicine.

[33]  Nick C Fox,et al.  The Alzheimer's disease neuroimaging initiative (ADNI): MRI methods , 2008, Journal of magnetic resonance imaging : JMRI.

[34]  Nick C Fox,et al.  The Alzheimer's disease neuroimaging initiative (ADNI): MRI methods , 2008, Journal of magnetic resonance imaging : JMRI.

[35]  A. Keziou,et al.  On empirical likelihood for semiparametric two-sample density ratio models , 2008 .

[36]  Gene H. Golub,et al.  The differentiation of pseudo-inverses and non-linear least squares problems whose variables separate , 1972, Milestones in Matrix Computation.

[37]  Alan E. Hubbard,et al.  Statistical Applications in Genetics and Molecular Biology Quantile-Function Based Null Distribution in Resampling Based Multiple Testing , 2011 .

[38]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[39]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[40]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[41]  A. Keziou,et al.  Test of homogeneity in semiparametric two-sample density ratio models , 2005 .

[42]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[43]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[44]  Imre Csiszár,et al.  Information Theory and Statistics: A Tutorial , 2004, Found. Trends Commun. Inf. Theory.

[45]  J. Friedman On Multivariate Goodness-of-Fit and Two-Sample Testing , 2004 .

[46]  J. Friedman On Multivariate Goodness-of-Fit and Two-Sample Testing , 2004 .

[47]  Amor Keziou Utilisation des Divergences entre Mesures en Statistique Inférentielle , 2003 .

[48]  N. Henze,et al.  On the multivariate runs test , 1999 .

[49]  N. Henze,et al.  On the multivariate runs test , 1999 .

[50]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[51]  S. Venkatesh,et al.  Asymptotic expansions of the k nearest neighbor risk , 1998 .

[52]  M. R. Rao,et al.  Combinatorial Optimization , 1992, NATO ASI Series.

[53]  M. R. Rao,et al.  Combinatorial Optimization , 1992, NATO ASI Series.

[54]  William J. Welch,et al.  Construction of Permutation Tests , 1990 .

[55]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .

[56]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .

[57]  J. Steele,et al.  On the number of leaves of a euclidean minimal spanning tree , 1987, Journal of Applied Probability.

[58]  M. Schilling Multivariate Two-Sample Tests Based on Nearest Neighbors , 1986 .

[59]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[60]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[61]  William L. Harkness,et al.  Sampling from a Graph , 1967 .

[62]  William L. Harkness,et al.  Sampling from a Graph , 1967 .

[63]  Walter T. Federer,et al.  Sequential Design of Experiments , 1967 .

[64]  Walter T. Federer,et al.  Sequential Design of Experiments , 1967 .

[65]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[66]  J. Hajnal A two-sample sequential t-test , 1961 .

[67]  J. Hajnal A two-sample sequential t-test , 1961 .

[68]  J. Wolfowitz,et al.  Optimum Character of the Sequential Probability Ratio Test , 1948 .

[69]  H. Hotelling The Generalization of Student’s Ratio , 1931 .

[70]  H. Hotelling The Generalization of Student’s Ratio , 1931 .