Kernel Mean Shrinkage Estimators

A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a standard estimator of the true kernel mean. Despite a widespread use of this estimator, we show that it can be improved thanks to the well-known Stein phenomenon. We propose a new family of estimators called kernel mean shrinkage estimators (KMSEs), which benefit from both theoretical justifications and good empirical performance. The results demonstrate that the proposed estimators outperform the standard one, especially in a "large d, small n" paradigm.

[1]  Kenji Fukumizu,et al.  Universality, Characteristic Kernels and RKHS Embedding of Measures , 2010, J. Mach. Learn. Res..

[2]  B. Silverman,et al.  Nonparametric Regression and Generalized Linear Models: A roughness penalty approach , 1993 .

[3]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[4]  Guy Lever,et al.  Conditional mean embeddings as regressors , 2012, ICML.

[5]  Le Song,et al.  Kernel Embeddings of Latent Tree Graphical Models , 2011, NIPS.

[6]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[7]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[8]  Korbinian Strimmer,et al.  Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks , 2008, J. Mach. Learn. Res..

[9]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[10]  Andreas Christmann,et al.  Universal Kernels on Non-Standard Input Spaces , 2010, NIPS.

[11]  Bernhard Schölkopf,et al.  Kernel Mean Estimation and Stein Effect , 2013, ICML.

[12]  Le Song,et al.  Robust Low Rank Kernel Embeddings of Multivariate Distributions , 2013, NIPS.

[13]  Nicolas Privault,et al.  Stein estimation for the drift of Gaussian processes using the Malliavin calculus , 2008, 0811.1153.

[14]  Le Song,et al.  Hilbert Space Embeddings of Hidden Markov Models , 2010, ICML.

[15]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[16]  N. Dinculeanu Vector Integration and Stochastic Integration in Banach Spaces , 2000, Oxford Handbooks Online.

[17]  L. Wasserman All of Nonparametric Statistics , 2005 .

[18]  Bernhard Schölkopf,et al.  One-Class Support Measure Machines for Group Anomaly Detection , 2013, UAI.

[19]  Lorenzo Rosasco,et al.  On regularization algorithms in learning theory , 2007, J. Complex..

[20]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[21]  Massimiliano Pontil,et al.  Multi-task Averaging via Task Clustering , 2013, SIMBAD.

[22]  M. Bock Minimax Estimators of the Mean of a Multivariate Normal Distribution , 1975 .

[23]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[24]  John Shawe-Taylor,et al.  Smooth Operators , 2013, ICML.

[25]  Le Song,et al.  Kernel Bayes' Rule , 2010, NIPS.

[26]  Hyunjoong Kim,et al.  Functional Analysis I , 2017 .

[27]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[28]  J. Berger Minimax Estimation of Location Vectors for a Wide Class of Densities , 1975 .

[29]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[30]  Maya R. Gupta,et al.  Multi-Task Averaging , 2012, NIPS.

[31]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[32]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[33]  Clayton D. Scott,et al.  Robust kernel density estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  L. Shepp,et al.  Admissibility as a Touchstone , 1987 .

[35]  Bernhard Schölkopf,et al.  Learning from Distributions via Support Measure Machines , 2012, NIPS.

[36]  J. Retherford Review: J. Diestel and J. J. Uhl, Jr., Vector measures , 1978 .

[37]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[38]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[39]  Aapo Hyvärinen,et al.  Density Estimation in Infinite Dimensional Exponential Families , 2013, J. Mach. Learn. Res..

[40]  Le Song,et al.  Kernel Belief Propagation , 2011, AISTATS.

[41]  B. Efron,et al.  Combining Possibly Related Estimation Problems , 1973 .

[42]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[43]  J. Neyman,et al.  INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[44]  Bernhard Schölkopf,et al.  Kernel Mean Estimation via Spectral Filtering , 2014, NIPS.

[45]  Bernhard Schölkopf,et al.  Injective Hilbert Space Embeddings of Probability Measures , 2008, COLT.

[46]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[47]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[48]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[49]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[50]  E. George Minimax Multiple Shrinkage Estimation , 1986 .

[51]  Zoltán Sasvári,et al.  Multivariate Characteristic and Correlation Functions , 2013 .

[52]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[53]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[54]  H. Engl,et al.  Regularization of Inverse Problems , 1996 .

[55]  H. Hudson A Natural Identity for Exponential Families with Applications in Multiparameter Estimation , 1978 .

[56]  Kenji Fukumizu,et al.  Statistical Consistency of Kernel Canonical Correlation Analysis , 2007 .

[57]  S. Mendelson,et al.  Regularization in kernel learning , 2010, 1001.2094.

[58]  R. Wolpert,et al.  Estimating the mean function of a Gaussian process and the Stein effect , 1983 .

[59]  J. Berger Admissible Minimax Estimation of a Multivariate Normal Mean with Arbitrary Quadratic Loss , 1976 .

[60]  J. Diestel,et al.  On vector measures , 1974 .

[61]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[62]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[63]  Le Song,et al.  Tailoring density estimation via reproducing kernel moment matching , 2008, ICML '08.

[64]  Aaditya Ramdas,et al.  Nonparametric Independence Testing for Small Sample Sizes , 2015, IJCAI.

[65]  M. Urner Scattered Data Approximation , 2016 .

[66]  Carlos Guestrin,et al.  Nonparametric Tree Graphical Models via Kernel Embeddings , 2010 .

[67]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[68]  Gert R. G. Lanckriet,et al.  On the empirical estimation of integral probability metrics , 2012 .

[69]  B. Efron,et al.  Limiting the Risk of Bayes and Empirical Bayes Estimators—Part II: The Empirical Bayes Case , 1972 .

[70]  C. Baker Joint measures and cross-covariance operators , 1973 .

[71]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[72]  V. Yurinsky Sums and Gaussian Vectors , 1995 .

[73]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[74]  Marvin H. J. Gruber Improving Efficiency by Shrinkage: The James--Stein and Ridge Regression Estimators , 1998 .

[75]  B. Efron,et al.  Stein's Paradox in Statistics , 1977 .