Sample-Efficient Kernel Mean Estimator with Marginalized Corrupted Data

Estimating the kernel mean in a reproducing kernel Hilbert space is central to many kernel-based learning algorithms. Given a finite sample, an empirical average is used as a standard estimation of the target kernel mean. Prior works have shown that better estimators can be constructed by shrinkage methods. In this work, we propose to corrupt data examples with noise from known distributions and present a new kernel mean estimator, called the marginalized kernel mean estimator, which estimates kernel mean under the corrupted distributions. Theoretically, we justify that the marginalized kernel mean estimator introduces implicit regularization in kernel mean estimation. Empirically, on a variety of tasks, we show that the marginalized kernel mean estimator is sample-efficient and obtains much lower estimation errors than the existing estimators.

[1]  Tongliang Liu,et al.  Pluralistic Image Completion with Probabilistic Mixture-of-Experts , 2022, ArXiv.

[2]  Kun Kuang,et al.  Semi-supervised Active Learning for Semi-supervised Models: Exploit Adversarial Examples with Graph-based Virtual Labels , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Weiya Shi,et al.  CMC-COV19D: Contrastive Mixup Classification for COVID-19 Diagnosis , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[4]  Hannah Marienwald,et al.  High-Dimensional Multi-Task Averaging and Application to Kernel Mean Embedding , 2020, AISTATS.

[5]  Jonathon S. Hare,et al.  FMix: Enhancing Mixed Sample Data Augmentation , 2020 .

[6]  Gang Niu,et al.  Are Anchor Points Really Indispensable in Label-Noise Learning? , 2019, NeurIPS.

[7]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Sunita Sarawagi,et al.  Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings , 2018, ICML.

[9]  Ioannis Mitliagkas,et al.  Manifold Mixup: Better Representations by Interpolating Hidden States , 2018, ICML.

[10]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[11]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[12]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[14]  John P. Cunningham,et al.  Bayesian Learning of Kernel Embeddings , 2016, UAI.

[15]  Krikamol Muandet,et al.  Minimax Estimation of Kernel Mean Embeddings , 2016, J. Mach. Learn. Res..

[16]  I. Tolstikhin,et al.  Towards a Learning Theory of Cause-Effect Inference , 2015, ICML.

[17]  Bernhard Schölkopf,et al.  Kernel Mean Estimation via Spectral Filtering , 2014, NIPS.

[18]  Yoshua Bengio,et al.  Marginalized Denoising Auto-encoders for Nonlinear Representations , 2014, ICML.

[19]  Aaditya Ramdas,et al.  Nonparametric Independence Testing for Small Sample Sizes , 2014, IJCAI.

[20]  Bernhard Schölkopf,et al.  Kernel Mean Shrinkage Estimators , 2014, J. Mach. Learn. Res..

[21]  Bharath K. Sriperumbudur,et al.  Two-stage sampled learning theory on distributions , 2014, AISTATS.

[22]  Stephen Tyree,et al.  Learning with Marginalized Corrupted Features , 2013, ICML.

[23]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[24]  Bernhard Schölkopf,et al.  Kernel Mean Estimation and Stein Effect , 2013, ICML.

[25]  Bernhard Schölkopf,et al.  One-Class Support Measure Machines for Group Anomaly Detection , 2013, UAI.

[26]  Sivaraman Balakrishnan,et al.  Optimal kernel choice for large-scale two-sample tests , 2012, NIPS.

[27]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[28]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[29]  Bernhard Schölkopf,et al.  Learning from Distributions via Support Measure Machines , 2012, NIPS.

[30]  Le Song,et al.  Kernel Belief Propagation , 2011, AISTATS.

[31]  Le Song,et al.  Hilbert Space Embeddings of Hidden Markov Models , 2010, ICML.

[32]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[33]  Alexander J. Smola,et al.  Hilbert space embeddings of conditional distributions with applications to dynamical systems , 2009, ICML '09.

[34]  Bernhard Schölkopf,et al.  Injective Hilbert Space Embeddings of Probability Measures , 2008, COLT.

[35]  Le Song,et al.  Tailoring density estimation via reproducing kernel moment matching , 2008, ICML '08.

[36]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[37]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[38]  Karsten M. Borgwardt,et al.  Supervised feature selection via dependence estimation , 2007, ICML '07.

[39]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[40]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[41]  Jian Yang,et al.  KPCA plus LDA: a complete kernel Fisher discriminant framework for feature extraction and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[43]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[44]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[45]  G. Roberts,et al.  Adaptive Markov Chain Monte Carlo through Regeneration , 1998 .

[46]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[47]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation , 1997, Neural Computation.

[48]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[49]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[50]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[51]  Chen Gong,et al.  Robust early-learning: Hindering the memorization of noisy labels , 2021, ICLR.

[52]  Masashi Sugiyama,et al.  Part-dependent Label Noise: Towards Instance-dependent Label Noise , 2020, NeurIPS.

[53]  Le Song,et al.  Kernel Bayes' rule: Bayesian inference with positive definite kernels , 2013, J. Mach. Learn. Res..

[54]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[55]  Jason Weston,et al.  Vicinal Risk Minimization , 2000, NIPS.

[56]  Yann LeCun,et al.  Transformation Invariance in Pattern Recognition-Tangent Distance and Tangent Propagation , 1996, Neural Networks: Tricks of the Trade.

[57]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .