A Fast, Consistent Kernel Two-Sample Test

A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of χ2 random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigen-spectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.

[1]  J. Wilkins A Note on Skewness and Kurtosis , 1944 .

[2]  E. Lehmann,et al.  Testing Statistical Hypothesis. , 1960 .

[3]  A. Markus THE EIGEN- AND SINGULAR VALUES OF THE SUM AND PRODUCT OF LINEAR OPERATORS , 1964 .

[4]  C. Baker Joint measures and cross-covariance operators , 1973 .

[5]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[6]  G. Grimmett,et al.  Probability and random processes , 2002 .

[7]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[8]  N. H. Anderson,et al.  Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates , 1994 .

[9]  L. Elsner,et al.  The Hoffman-Wielandt inequality in infinite dimensions , 1994 .

[10]  F. Famoye Continuous Univariate Distributions, Volume 1 , 1994 .

[11]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[12]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[13]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[14]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[15]  Gene H. Golub,et al.  An Inverse Free Preconditioned Krylov Subspace Method for Symmetric Generalized Eigenvalue Problems , 2002, SIAM J. Sci. Comput..

[16]  P. Hall,et al.  Permutation tests for equality of distributions in high‐dimensional settings , 2002 .

[17]  G. Zech,et al.  A Multivariate Two-Sample Test Based on the Concept of Minimum Energy , 2003 .

[18]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[19]  O. Bousquet,et al.  Kernels, Associated Structures and Generalizations , 2004 .

[20]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[21]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[22]  Nello Cristianini,et al.  On the eigenspectrum of the gram matrix and the generalization error of kernel-PCA , 2005, IEEE Transactions on Information Theory.

[23]  László Györfi,et al.  On the asymptotic properties of a nonparametric L/sub 1/-test statistic of homogeneity , 2005, IEEE Transactions on Information Theory.

[24]  Choon Hui Teo,et al.  Fast and space efficient string kernels using suffix arrays , 2006, ICML.

[25]  Gilles Blanchard,et al.  Statistical properties of kernel principal component analysis , 2007, Machine Learning.

[26]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[27]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[28]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[29]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[30]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[31]  Zaïd Harchaoui,et al.  Testing for Homogeneity with Kernel Fisher Discriminant Analysis , 2007, NIPS.

[32]  Bernhard Schölkopf,et al.  Characteristic Kernels on Groups and Semigroups , 2008, NIPS.

[33]  Bernhard Schölkopf,et al.  Injective Hilbert Space Embeddings of Probability Measures , 2008, COLT.

[34]  Arthur Gretton,et al.  Inferring spike trains from local field potentials. , 2008, Journal of neurophysiology.