Hilbert Space Embeddings and Metrics on Probability Measures

A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk, indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk. First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on ℜd, then it is characteristic if and only if the support of its Fourier transform is the entire ℜd. Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the nature of the topology induced by γk, we relate γk to other popular metrics on probability measures, and present conditions on the kernel k under which γk metrizes the weak topology.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  E. Lehmann,et al.  Testing Statistical Hypothesis. , 1960 .

[3]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[4]  C J Isham,et al.  Methods of Modern Mathematical Physics, Vol 1: Functional Analysis , 1972 .

[5]  M. Reed Methods of Modern Mathematical Physics. I: Functional Analysis , 1972 .

[6]  C. Stein A bound for the error in the normal approximation to the distribution of a sum of dependent random variables , 1972 .

[7]  S. S. Vallender Calculation of the Wasserstein Distance Between Probability Distributions on the Line , 1974 .

[8]  M. Rosenblatt A Quadratic Measure of Deviation of Two-Dimensional Density Estimates and A Test of Independence , 1975 .

[9]  J. Stewart Positive definite functions and generalizations, an historical survey , 1976 .

[10]  Gerald B. Folland,et al.  Real Analysis: Modern Techniques and Their Applications , 1984 .

[11]  C. Berg,et al.  Harmonic Analysis on Semigroups , 1984 .

[12]  C. Micchelli,et al.  Some remarks on ridge functions , 1987 .

[13]  I. Vajda Theory of statistical inference and information , 1989 .

[14]  L. Devroye,et al.  No Empirical Probability Measure can Converge in the Total Variation Sense for all Distributions , 1990 .

[15]  Jim Freeman Probability Metrics and the Stability of Stochastic Models , 1991 .

[16]  T. Lindvall Lectures on the Coupling Method , 1992 .

[17]  N. H. Anderson,et al.  Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates , 1994 .

[18]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[19]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[20]  C. Gasquet,et al.  Fourier analysis and applications , 1998 .

[21]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[22]  S. Rachev,et al.  Mass transportation problems , 1998 .

[23]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[24]  J. A. Cuesta-Albertos,et al.  Tests of goodness of fit based on the $L_2$-Wasserstein distance , 1999 .

[25]  Mtw,et al.  Mass Transportation Problems: Vol. I: Theory@@@Mass Transportation Problems: Vol. II: Applications , 1999 .

[26]  G. Shorack Probability for Statisticians , 2000 .

[27]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[28]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[29]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[30]  Pierre Brémaud,et al.  Mathematical principles of signal processing , 2002 .

[31]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[32]  Allan Pinkus,et al.  Strictly Positive Definite Functions on a Real Inner Product Space , 2004, Adv. Comput. Math..

[33]  Holger Wendland,et al.  Scattered Data Approximation: Conditionally positive definite functions , 2004 .

[34]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[35]  Matthias Hein,et al.  Hilbertian Metrics on Probability Measures and Their Application in SVM?s , 2004, DAGM-Symposium.

[36]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[37]  N. Logothetis,et al.  Behaviour and Convergence of the Constrained Covariance , 2004 .

[38]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[39]  Louis H. Y. Chen,et al.  An Introduction to Stein's Method , 2005 .

[40]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[41]  Bernhard Schölkopf,et al.  Kernel Methods for Measuring Independence , 2005, J. Mach. Learn. Res..

[42]  Qing Wang,et al.  Divergence estimation of continuous distributions based on data-dependent partitions , 2005, IEEE Transactions on Information Theory.

[43]  Bernhard Schölkopf,et al.  Kernel Constrained Covariance for Dependence Measurement , 2005, AISTATS.

[44]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[45]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[46]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[47]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[48]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[49]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[50]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[51]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[52]  Bernhard Schölkopf,et al.  Characteristic Kernels on Groups and Semigroups , 2008, NIPS.

[53]  Bharath K. Sriperumbudur,et al.  RKHS Representation of Measures Applied to Homogeneity, Independence, and Fourier Optics , 2008 .

[54]  Bernhard Schölkopf,et al.  Injective Hilbert Space Embeddings of Probability Measures , 2008, COLT.

[55]  Bernhard Schölkopf,et al.  Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions , 2009, NIPS.

[56]  Michael I. Jordan,et al.  Kernel dimension reduction in regression , 2009, 0908.1854.

[57]  Kenji Fukumizu,et al.  On integral probability metrics, φ-divergences and binary classification , 2009, 0901.2698.

[58]  Hao Shen,et al.  Fast Kernel-Based Independent Component Analysis , 2009, IEEE Transactions on Signal Processing.

[59]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[60]  Kenji Fukumizu,et al.  Universality, Characteristic Kernels and RKHS Embedding of Measures , 2010, J. Mach. Learn. Res..

[61]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.