Asymptotically Optimal One- and Two-Sample Testing With Kernels

We characterize the asymptotic performance of nonparametric one- and two-sample testing. The exponential decay rate or error exponent of the type-II error probability is used as the asymptotic performance metric, and an optimal test achieves the maximum rate subject to a constant level constraint on the type-I error probability. With Sanov's theorem, we derive a sufficient condition for one-sample tests to achieve the optimal error exponent in the universal setting, i.e., for any distribution defining the alternative hypothesis. We then show that two classes of Maximum Mean Discrepancy (MMD) based tests attain the optimal type-II error exponent on $\mathbb R^d$, while the quadratic-time Kernel Stein Discrepancy (KSD) based tests achieve this optimality with an asymptotic level constraint. For general two-sample testing, however, Sanov's theorem is insufficient to obtain a similar sufficient condition. We proceed to establish an extended version of Sanov's theorem and derive an exact error exponent for the quadratic-time MMD based two-sample tests. The obtained error exponent is further shown to be optimal among all two-sample tests satisfying a given level constraint. Our results not only solve a long-standing open problem in information theory and statistics, but also provide an achievability result for optimal nonparametric one- and two-sample testing. Application to off-line change detection and related issues are also discussed.

[1]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[2]  Arthur Gretton,et al.  A Kernel Test of Goodness of Fit , 2016, ICML.

[3]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[4]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[5]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[6]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[7]  P. Doukhan,et al.  Weak Dependence: With Examples and Applications , 2007 .

[8]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[9]  Lester W. Mackey,et al.  Measuring Sample Quality with Stein's Method , 2015, NIPS.

[10]  Ann. Probab Distance Covariance in Metric Spaces , 2017 .

[11]  C. Matr,et al.  Tests of Goodness of Fit Based on the L 2 -wasserstein Distance , 2007 .

[12]  Bernhard Schölkopf,et al.  Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions , 2009, NIPS.

[13]  Wojciech Zaremba,et al.  B-test: A Non-parametric, Low Variance Kernel Two-sample Test , 2013, NIPS.

[14]  E. Giné,et al.  Asymptotics for L2 functionals of the empirical quantile process, with applications to tests of fit based on weighted Wasserstein distances , 2005 .

[15]  Daniel Denkovski,et al.  HOS Based Goodness-of-Fit Testing Signal Detection , 2012, IEEE Communications Letters.

[16]  Takafumi Kanamori,et al.  $f$ -Divergence Estimation and Two-Sample Homogeneity Test Under Semiparametric Density-Ratio Models , 2010, IEEE Transactions on Information Theory.

[17]  Biao Chen,et al.  Robust Kullback-Leibler Divergence and Universal Hypothesis Testing for Continuous Distributions , 2019, IEEE Transactions on Information Theory.

[18]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[19]  L. Baringhaus,et al.  On a new multivariate two-sample test , 2004 .

[20]  I. M. Glazman,et al.  Theory of linear operators in Hilbert space , 1961 .

[21]  Le Song,et al.  M-Statistic for Kernel Change-Point Detection , 2015, NIPS.

[22]  Kenji Fukumizu,et al.  A Linear-Time Kernel Goodness-of-Fit Test , 2017, NIPS.

[23]  Bernhard Schölkopf,et al.  A Permutation-Based Kernel Conditional Independence Test , 2014, UAI.

[24]  W. Hoeffding Asymptotically Optimal Tests for Multinomial Distributions , 1965 .

[25]  Barnabás Póczos,et al.  Two-stage sampled learning theory on distributions , 2015, AISTATS.

[26]  Olivier Thas,et al.  Asymptotically Optimal Tests , 2010 .

[27]  Lester W. Mackey,et al.  Measuring Sample Quality with Kernels , 2017, ICML.

[28]  Neri Merhav,et al.  Universal composite hypothesis testing: A competitive minimax approach , 2002, IEEE Trans. Inf. Theory.

[29]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .

[30]  Krishnakumar Balasubramanian,et al.  On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests , 2017, J. Mach. Learn. Res..

[31]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[32]  Zhitang Chen,et al.  Universal Hypothesis Testing with Kernels: Asymptotically Optimal Tests for Goodness of Fit , 2018, AISTATS.

[33]  Ofer Zeitouni,et al.  On universal hypotheses testing via large deviations , 1991, IEEE Trans. Inf. Theory.

[34]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[35]  Arthur Gretton,et al.  A Wild Bootstrap for Degenerate Kernel Tests , 2014, NIPS.

[36]  Haiquan Wang,et al.  Spectrum sensing in cognitive radio using goodness of fit testing , 2009, IEEE Transactions on Wireless Communications.

[37]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[38]  Arthur Gretton,et al.  Learning Theory for Distribution Regression , 2014, J. Mach. Learn. Res..

[39]  H. Vincent Poor,et al.  Nonparametric Detection of Anomalous Data Streams , 2014, IEEE Transactions on Signal Processing.

[40]  Zoubin Ghahramani,et al.  Statistical Model Criticism using Kernel Two Sample Tests , 2015, NIPS.

[41]  Alexander J. Smola,et al.  Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy , 2016, ICLR.

[42]  L. Györfi,et al.  A Consistent Goodness of Fit Test Based on the Total Variation Distance , 1991 .

[43]  A. Leucht,et al.  Degenerate U- and V-statistics under weak dependence : Asymptotic theory and bootstrap consistency , 2012, 1205.1892.

[44]  Imre Csiszár A simple proof of Sanov’s theorem* , 2006 .

[45]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[46]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[47]  R. Zamar,et al.  A multivariate Kolmogorov-Smirnov test of goodness of fit , 1997 .

[48]  I. N. Sanov On the probability of large deviations of random variables , 1958 .

[49]  Arthur Gretton,et al.  Fast Two-Sample Testing with Analytic Representations of Probability Measures , 2015, NIPS.

[50]  D. Siegmund,et al.  Tests for a change-point , 1987 .

[51]  Zaïd Harchaoui,et al.  Kernel Change-point Analysis , 2008, NIPS.

[52]  Manuel Davy,et al.  An online kernel change detection algorithm , 2005, IEEE Transactions on Signal Processing.

[53]  Bernhard Schölkopf,et al.  Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions , 2016, J. Mach. Learn. Res..

[54]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[55]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[56]  N. Chopin,et al.  Control functionals for Monte Carlo integration , 2014, 1410.2392.

[57]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[58]  Qiang Liu,et al.  A Kernelized Stein Discrepancy for Goodness-of-fit Tests , 2016, ICML.

[59]  Maria L. Rizzo,et al.  TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION , 2004 .

[60]  Sivaraman Balakrishnan,et al.  Optimal kernel choice for large-scale two-sample tests , 2012, NIPS.

[61]  Kenji Fukumizu,et al.  Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[62]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[63]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[64]  R. R. Bahadur Stochastic comparison of tests , 1960 .

[65]  Gert R. G. Lanckriet,et al.  On the empirical estimation of integral probability metrics , 2012 .

[66]  AN Kolmogorov-Smirnov,et al.  Sulla determinazione empírica di uma legge di distribuzione , 1933 .

[67]  Zaïd Harchaoui,et al.  A Fast, Consistent Kernel Two-Sample Test , 2009, NIPS.

[68]  C. Carmeli,et al.  Vector valued reproducing kernel Hilbert spaces and universality , 2008, 0807.1659.

[69]  Bharath K. Sriperumbudur On the optimal estimation of probability measures in weak and strong topologies , 2013, 1310.8240.

[70]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[71]  Marco Cuturi,et al.  On Wasserstein Two-Sample Testing and Related Families of Nonparametric Tests , 2015, Entropy.

[72]  J. A. Cuesta-Albertos,et al.  Tests of goodness of fit based on the $L_2$-Wasserstein distance , 1999 .

[73]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[74]  Thomas M. Cover,et al.  Elements of information theory (2. ed.) , 2006 .

[75]  Oluwasanmi Koyejo,et al.  Examples are not enough, learn to criticize! Criticism for Interpretability , 2016, NIPS.

[76]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[77]  Sean P. Meyn,et al.  Universal and Composite Hypothesis Testing via Mismatched Divergence , 2009, IEEE Transactions on Information Theory.

[78]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[79]  E. Giné,et al.  On the Bootstrap of $U$ and $V$ Statistics , 1992 .