Learning Theory for Vector-Valued Distribution Regression

We focus on the distribution regression problem (DRP): we regress from probability measures to Hilbert-space valued outputs, where the input distributions are only available through samples (this is the 'two-stage sampled' setting). Several important statistical and machine learning problems can be phrased within this framework including point estimation tasks without analytical solution (such as entropy estimation), or multi-instance learning. However, due to the two-stage sampled nature of the problem, the theoretical analysis becomes quite challenging: to the best of our knowledge the only existing method with performance guarantees to solve the DRP task requires density estimation (which often performs poorly in practise) and the distributions to be defined on a compact Euclidean domain. We present a simple, analytically tractable alternative to solve the DRP task: we embed the distributions to a reproducing kernel Hilbert space and perform ridge regression from the embedded distributions to the outputs. We prove that this scheme is consistent under mild conditions, and construct explicit finite sample bounds on its excess risk as a function of the sample numbers and the problem difficulty, which hold with high probability. Specifically, we establish the consistency of set kernels in regression, which was a 15-year-old-open question, and also present new kernels on embedded distributions. The practical efficiency of the studied technique is illustrated in supervised entropy learning and aerosol prediction using multispectral satellite images.

[1]  Zhi-Hua Zhou,et al.  Multi-instance clustering with applications to multi-instance prediction , 2009, Applied Intelligence.

[2]  Barnabás Póczos,et al.  Distribution to Distribution Regression , 2013, ICML.

[3]  Kenji Fukumizu,et al.  Semigroup Kernels on Measures , 2005, J. Mach. Learn. Res..

[4]  Slobodan Vucetic,et al.  Mixture Model for Multiple Instance Regression and Applications in Remote Sensing , 2012, IEEE Transactions on Geoscience and Remote Sensing.

[5]  Aapo Hyvärinen,et al.  Density Estimation in Infinite Dimensional Exponential Families , 2013, J. Mach. Learn. Res..

[6]  Frank Nielsen,et al.  A closed-form expression for the Sharma–Mittal entropy of exponential families , 2011, ArXiv.

[7]  Ying Chen,et al.  Contextual Hausdorff dissimilarity for multi-instance clustering , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[8]  Daniel Rueckert,et al.  Medical Image Computing and Computer-Assisted Intervention − MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part II , 2017, Lecture Notes in Computer Science.

[9]  Rama Chellappa,et al.  From sample similarity to ensemble similarity: probabilistic distance measures in reproducing kernel Hilbert space , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Frederick Eberhardt,et al.  Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI) , 2017 .

[11]  James M. Robins,et al.  Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations , 2014, ArXiv.

[12]  Barnabás Póczos,et al.  Linear-Time Learning on Distributions with Approximate Kernel Embeddings , 2015, AAAI.

[13]  Tobias Scheffer,et al.  International Conference on Machine Learning (ICML-99) , 1999, Künstliche Intell..

[14]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[15]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[16]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[17]  Jun Gao,et al.  Identifying Multi-instance Outliers , 2010, SDM.

[18]  Barnabás Póczos,et al.  Distribution-Free Distribution Regression , 2013, AISTATS.

[19]  Kathryn B. Laskey,et al.  Uncertainty in Artificial Intelligence 15 , 1999 .

[20]  Barnabás Póczos,et al.  Nonparametric Divergence Estimation with Applications to Machine Learning on Distributions , 2011, UAI.

[21]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[22]  David Beymer,et al.  Closed-Form Jensen-Renyi Divergence for Mixture of Gaussians and Applications to Group-Wise Shape Registration , 2009, MICCAI.

[23]  Barnabás Póczos,et al.  k-NN Regression on Functional Data with Incomplete Observations , 2014, UAI.

[24]  Jun Wang,et al.  Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[25]  R. Gillan New Editor-in-Chief for Journal of Physics A: Mathematical and Theoretical , 2014 .

[26]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[27]  G. Loukidis,et al.  SIAM International Conference on Data Mining (SDM) , 2015 .