Minimax Estimation of Kernel Mean Embeddings

In this paper, we study the minimax estimation of the Bochner integral $$\mu_k(P):=\int_{\mathcal{X}} k(\cdot,x)\,dP(x),$$ also called as the kernel mean embedding, based on random samples drawn i.i.d.~from $P$, where $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ is a positive definite kernel. Various estimators (including the empirical estimator), $\hat{\theta}_n$ of $\mu_k(P)$ are studied in the literature wherein all of them satisfy $\bigl\| \hat{\theta}_n-\mu_k(P)\bigr\|_{\mathcal{H}_k}=O_P(n^{-1/2})$ with $\mathcal{H}_k$ being the reproducing kernel Hilbert space induced by $k$. The main contribution of the paper is in showing that the above mentioned rate of $n^{-1/2}$ is minimax in $\|\cdot\|_{\mathcal{H}_k}$ and $\|\cdot\|_{L^2(\mathbb{R}^d)}$-norms over the class of discrete measures and the class of measures that has an infinitely differentiable density, with $k$ being a continuous translation-invariant kernel on $\mathbb{R}^d$. The interesting aspect of this result is that the minimax rate is independent of the smoothness of the kernel and the density of $P$ (if it exists). This result has practical consequences in statistical applications as the mean embedding has been widely employed in non-parametric hypothesis testing, density estimation, causal inference and feature selection, through its relation to energy distance (and distance covariance).

[1]  Bharath K. Sriperumbudur On the optimal estimation of probability measures in weak and strong topologies , 2013, 1310.8240.

[2]  Le Song,et al.  Feature Selection via Dependence Maximization , 2012, J. Mach. Learn. Res..

[3]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[4]  E. Berger UNIFORM CENTRAL LIMIT THEOREMS (Cambridge Studies in Advanced Mathematics 63) By R. M. D UDLEY : 436pp., £55.00, ISBN 0-521-46102-2 (Cambridge University Press, 1999). , 2001 .

[5]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[6]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[7]  Kenji Fukumizu,et al.  Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[8]  Kenji Fukumizu,et al.  Universality, Characteristic Kernels and RKHS Embedding of Measures , 2010, J. Mach. Learn. Res..

[9]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[10]  N. Dinculeanu Vector Integration and Stochastic Integration in Banach Spaces , 2000, Oxford Handbooks Online.

[11]  Gert R. G. Lanckriet,et al.  On the empirical estimation of integral probability metrics , 2012 .

[12]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[13]  Bharath K. Sriperumbudur Mixture density estimation via Hilbert space embedding of measures , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[14]  R. M. Dudley,et al.  Real Analysis and Probability , 1989 .

[15]  Maria L. Rizzo,et al.  Brownian distance covariance , 2009, 1010.0297.

[16]  Barnabás Póczos,et al.  On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests in High Dimensions , 2014, AAAI.

[17]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[18]  Jean-Philippe Vert,et al.  Consistency and Convergence Rates of One-Class SVMs and Related Algorithms , 2006, J. Mach. Learn. Res..

[19]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[20]  Bernhard Schölkopf,et al.  Kernel Mean Shrinkage Estimators , 2014, J. Mach. Learn. Res..

[21]  R. Lyons Distance covariance in metric spaces , 2011, 1106.5758.

[22]  I. J. Schoenberg Metric spaces and completely monotone functions , 1938 .

[23]  Gerald B. Folland,et al.  Real Analysis: Modern Techniques and Their Applications , 1984 .

[24]  V. Yurinsky Sums and Gaussian Vectors , 1995 .

[25]  I. S. Gradshteyn,et al.  Table of Integrals, Series, and Products , 1976 .

[26]  Barnabás Póczos,et al.  Two-stage sampled learning theory on distributions , 2015, AISTATS.

[27]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[28]  Bernhard Schölkopf,et al.  Towards a Learning Theory of Causation , 2015, 1502.02398.

[29]  Holger Wendland,et al.  Scattered Data Approximation: Conditionally positive definite functions , 2004 .

[30]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[31]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[32]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .