Two-stage sampled learning theory on distributions

We focus on the distribution regression problem: regressing to a real-valued response from a probability distribution. Although there exist a large number of similarity measures between distributions, very little is known about their generalization performance in specific learning tasks. Learning problems formulated on distributions have an inherent two-stage sampled difficulty: in practice only samples from sampled distributions are observable, and one has to build an estimate on similarities computed between sets of points. To the best of our knowledge, the only existing method with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which suffers from slow convergence issues in high dimensions), and the domain of the distributions to be compact Euclidean. In this paper, we provide theoretical guarantees for a remarkably simple algorithmic alternative to solve the distribution regression problem: embed the distributions to a reproducing kernel Hilbert space, and learn a ridge regressor from the embeddings to the outputs. Our main contribution is to prove the consistency of this technique in the two-stage sampled setting under mild conditions (on separable, topological domains endowed with kernels). For a given total number of observations, we derive convergence rates as an explicit function of the problem difficulty. As a special case, we answer a 15-year-old open question: we establish the consistency of the classical set kernel [Haussler, 1999; Gartner et. al, 2002] in regression, and cover more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010].

[1]  Thomas Gärtner,et al.  Multi-Instance Kernels , 2002, ICML.

[2]  Jun Gao,et al.  Identifying Multi-instance Outliers , 2010, SDM.

[3]  Barnabás Póczos,et al.  Support Distribution Machines , 2012, ArXiv.

[4]  Nitakshi Goyal,et al.  General Topology-I , 2017 .

[5]  Arthur Gretton,et al.  Learning Theory for Distribution Regression , 2014, J. Mach. Learn. Res..

[6]  Tobias Scheffer,et al.  International Conference on Machine Learning (ICML-99) , 1999, Künstliche Intell..

[7]  O. Gaans Probability measures on metric spaces , 2022 .

[8]  Frederick Eberhardt,et al.  Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI) , 2017 .

[9]  Barnabás Póczos,et al.  k-NN Regression on Functional Data with Incomplete Observations , 2014, UAI.

[10]  Le Song,et al.  A unified kernel framework for nonparametric inference in graphical models ] Kernel Embeddings of Conditional Distributions , 2013 .

[11]  Zoltán Szabó,et al.  Information theoretical estimators toolbox , 2014, J. Mach. Learn. Res..

[12]  Kenji Fukumizu,et al.  Universality, Characteristic Kernels and RKHS Embedding of Measures , 2010, J. Mach. Learn. Res..

[13]  R. Gillan New Editor-in-Chief for Journal of Physics A: Mathematical and Theoretical , 2014 .

[14]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[15]  David Beymer,et al.  Closed-Form Jensen-Renyi Divergence for Mixture of Gaussians and Applications to Group-Wise Shape Registration , 2009, MICCAI.

[16]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[17]  Jun Wang,et al.  Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[18]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[19]  Zhi-Hua Zhou,et al.  Multi-instance clustering with applications to multi-instance prediction , 2009, Applied Intelligence.

[20]  Martin J. Wainwright,et al.  Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates , 2013, J. Mach. Learn. Res..

[21]  Bharath K. Sriperumbudur On the optimal estimation of probability measures in weak and strong topologies , 2013, 1310.8240.

[22]  Andreas Christmann,et al.  Universal Kernels on Non-Standard Input Spaces , 2010, NIPS.

[23]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[24]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[25]  Daniel Rueckert,et al.  Medical Image Computing and Computer-Assisted Intervention − MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part II , 2017, Lecture Notes in Computer Science.

[26]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[27]  Samory Kpotufe,et al.  k-NN Regression Adapts to Local Intrinsic Dimension , 2011, NIPS.

[28]  Tony Jebara,et al.  A Kernel Between Sets of Vectors , 2003, ICML.

[29]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[30]  Barnabás Póczos,et al.  Consistent, Two-Stage Sampled Distribution Regression via Mean Embedding , 2014, ArXiv.

[31]  Barnabás Póczos,et al.  Distribution-Free Distribution Regression , 2013, AISTATS.

[32]  Kathryn B. Laskey,et al.  Uncertainty in Artificial Intelligence 15 , 1999 .

[33]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[34]  Barnabás Póczos,et al.  Nonparametric Divergence Estimation with Applications to Machine Learning on Distributions , 2011, UAI.

[35]  G. Loukidis,et al.  SIAM International Conference on Data Mining (SDM) , 2015 .

[36]  Frank Nielsen,et al.  A closed-form expression for the Sharma–Mittal entropy of exponential families , 2011, ArXiv.

[37]  Ying Chen,et al.  Contextual Hausdorff dissimilarity for multi-instance clustering , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[38]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[39]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[40]  Rama Chellappa,et al.  From sample similarity to ensemble similarity: probabilistic distance measures in reproducing kernel Hilbert space , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[42]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[43]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[44]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[45]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[46]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[47]  Sally A. Goldman,et al.  Multiple-Instance Learning of Real-Valued Data , 2001, J. Mach. Learn. Res..

[48]  David Page,et al.  Multiple Instance Regression , 2001, ICML.

[49]  Bernhard Schölkopf,et al.  Learning from Distributions via Support Measure Machines , 2012, NIPS.

[50]  Slobodan Vucetic,et al.  Mixture Model for Multiple Instance Regression and Applications in Remote Sensing , 2012, IEEE Transactions on Geoscience and Remote Sensing.

[51]  Barnabás Póczos,et al.  Distribution to Distribution Regression , 2013, ICML.

[52]  M. Reed Methods of Modern Mathematical Physics. I: Functional Analysis , 1972 .

[53]  Kenji Fukumizu,et al.  Semigroup Kernels on Measures , 2005, J. Mach. Learn. Res..

[54]  C. Berg,et al.  Harmonic Analysis on Semigroups , 1984 .

[55]  Barnabás Póczos,et al.  Fast Distribution To Real Regression , 2013, AISTATS.

[56]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..