Learning Theory for Distribution Regression

We focus on the distribution regression problem: regressing to vector-valued outputs from probability measures. Many important machine learning and statistical tasks fit into this framework, including multi-instance learning and point estimation problems without analytical solution (such as hyperparameter or entropy estimation). Despite the large number of available heuristics in the literature, the inherent two-stage sampled nature of the problem makes the theoretical analysis quite challenging, since in practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between sets of points. To the best of our knowledge, the only existing technique with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which often performs poorly in practice), and the domain of the distributions to be compact Euclidean. In this paper, we study a simple, analytically computable, ridge regression-based alternative to distribution regression, where we embed the distributions to a reproducing kernel Hilbert space, and learn the regressor from the embeddings to the outputs. Our main contribution is to prove that this scheme is consistent in the two-stage sampled setup under mild conditions (on separable topological domains enriched with kernels): we present an exact computational-statistical efficiency trade-off analysis showing that our estimator is able to match the one-stage sampled minimax optimal rate [Caponnetto and De Vito, 2007; Steinwart et al., 2009]. This result answers a 17-year-old open question, establishing the consistency of the classical set kernel [Haussler, 1999; Gaertner et. al, 2002] in regression. We also cover consistency for more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010].

[1]  Ingo Steinwart,et al.  Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs , 2012 .

[2]  Tony Jebara,et al.  A Kernel Between Sets of Vectors , 2003, ICML.

[3]  Philippe Preux,et al.  Multiple Operator-valued Kernel Learning , 2012, NIPS.

[4]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[5]  Adam Tauman Kalai,et al.  A Note on Learning from Multiple-Instance Examples , 2004, Machine Learning.

[6]  Martin J. Wainwright,et al.  Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.

[7]  Eric P. Xing,et al.  Nonextensive Information Theoretic Kernels on Measures , 2009, J. Mach. Learn. Res..

[8]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[9]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[10]  Zhi-Hua Zhou,et al.  Multi-instance clustering with applications to multi-instance prediction , 2009, Applied Intelligence.

[11]  Bastian Goldlücke,et al.  Variational Analysis , 2014, Computer Vision, A Reference Guide.

[12]  Michael K. Ng,et al.  Multi-Instance Dimensionality Reduction , 2010, AAAI.

[13]  David Beymer,et al.  Closed-Form Jensen-Renyi Divergence for Mixture of Gaussians and Applications to Group-Wise Shape Registration , 2009, MICCAI.

[14]  David Page,et al.  Multiple Instance Regression , 2001, ICML.

[15]  Zoltán Szabó,et al.  Information theoretical estimators toolbox , 2014, J. Mach. Learn. Res..

[16]  Concha Bielza,et al.  A survey on multi‐output regression , 2015, WIREs Data Mining Knowl. Discov..

[17]  H. Engl,et al.  Regularization of Inverse Problems , 1996 .

[18]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[19]  Rama Chellappa,et al.  From sample similarity to ensemble similarity: probabilistic distance measures in reproducing kernel Hilbert space , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Bernhard Schölkopf,et al.  Learning from Distributions via Support Measure Machines , 2012, NIPS.

[21]  PóczosBarnabás,et al.  Learning theory for distribution regression , 2016 .

[22]  Murat Dundar,et al.  Bayesian multiple instance learning: automatic feature selection and inductive transfer , 2008, ICML '08.

[23]  Luo Si,et al.  M3IC: Maximum Margin Multiple Instance Clustering , 2009, IJCAI.

[24]  Kellen Petersen August Real Analysis , 2009 .

[25]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[26]  Jun Wang,et al.  Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[27]  Kenji Fukumizu,et al.  Universality, Characteristic Kernels and RKHS Embedding of Measures , 2010, J. Mach. Learn. Res..

[28]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[29]  Jing Chai,et al.  Multiple-instance discriminant analysis , 2014, Pattern Recognit..

[30]  Boris Babenko,et al.  Multiple Instance Learning with Manifold Bags , 2011, ICML.

[31]  L. Wasserman All of Nonparametric Statistics , 2005 .

[32]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[33]  C. Carmeli,et al.  Vector valued reproducing kernel Hilbert spaces and universality , 2008, 0807.1659.

[34]  A. Caponnetto Optimal Rates for Regularization Operators in Learning Theory , 2006 .

[35]  Jun Gao,et al.  Identifying Multi-instance Outliers , 2010, SDM.

[36]  Barnabás Póczos,et al.  Linear-Time Learning on Distributions with Approximate Kernel Embeddings , 2015, AAAI.

[37]  S. Smale,et al.  ESTIMATING THE APPROXIMATION ERROR IN LEARNING THEORY , 2003 .

[38]  Stéphane Canu,et al.  Operator-valued Kernels for Learning from Functional Response Data , 2015, J. Mach. Learn. Res..

[39]  Neil D. Lawrence,et al.  Kernels for Vector-Valued Functions: a Review , 2011, Found. Trends Mach. Learn..

[40]  Mykola Pechenizkiy,et al.  HyDR-MI: A hybrid algorithm to reduce dimensionality in multiple instance learning , 2013, Inf. Sci..

[41]  Lorenzo Rosasco,et al.  Less is More: Nyström Computational Regularization , 2015, NIPS.

[42]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[43]  Nitakshi Goyal,et al.  General Topology-I , 2017 .

[44]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[45]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[46]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[47]  Florence d'Alché-Buc,et al.  Semi-supervised Penalized Output Kernel Regression for Link Prediction , 2011, ICML.

[48]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[49]  Dan Zhang,et al.  MILEAGE: Multiple Instance LEArning with Global Embedding , 2013, ICML.

[50]  G. A. Edgar Measure, Topology, and Fractal Geometry , 1990 .

[51]  Barnabás Póczos,et al.  Distribution-Free Distribution Regression , 2013, AISTATS.

[52]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[53]  Zhi-Hua Zhou Multi-Instance Learning : A Survey , 2004 .

[54]  Aravind Srinivasan,et al.  Approximating Hyper-Rectangles: Learning and Pseudorandom Sets , 1998, J. Comput. Syst. Sci..

[55]  Slobodan Vucetic,et al.  Mixture Model for Multiple Instance Regression and Applications in Remote Sensing , 2012, IEEE Transactions on Geoscience and Remote Sensing.

[56]  Aihui Zhou,et al.  A spectrum theorem for perturbed bounded linear operators , 2008, Appl. Math. Comput..

[57]  Nenghai Yu,et al.  Multiple-instance ranking: Learning to rank images for image retrieval , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Hongwei Sun,et al.  Application of integral operator for regularized least-square regression , 2009, Math. Comput. Model..

[59]  James R. Foulds,et al.  A review of multi-instance learning assumptions , 2010, The Knowledge Engineering Review.

[60]  Charles A. Micchelli,et al.  On Learning Vector-Valued Functions , 2005, Neural Computation.

[61]  Fei Wang,et al.  Maximum Margin Multiple Instance Clustering With Applications to Image and Text Clustering , 2011, IEEE Transactions on Neural Networks.

[62]  Barnabás Póczos,et al.  Nonparametric Divergence Estimation with Applications to Machine Learning on Distributions , 2011, UAI.

[63]  S. Mendelson,et al.  Regularization in kernel learning , 2010, 1001.2094.

[64]  Barnabás Póczos,et al.  Two-stage sampled learning theory on distributions , 2015, AISTATS.

[65]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[66]  Sally A. Goldman,et al.  Multiple-Instance Learning of Real-Valued Data , 2001, J. Mach. Learn. Res..

[67]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[68]  E. D. Vito,et al.  DISCRETIZATION ERROR ANALYSIS FOR TIKHONOV REGULARIZATION , 2006 .

[69]  George Pedrick,et al.  Theory of reproducing kernels for Hilbert spaces of vector valued functions , 1957 .

[70]  Martin J. Wainwright,et al.  Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates , 2013, J. Mach. Learn. Res..

[71]  Alexander J. Smola,et al.  Who Supported Obama in 2012?: Ecological Inference through Distribution Regression , 2015, KDD.

[72]  C. Berg,et al.  Harmonic Analysis on Semigroups , 1984 .

[73]  Maurice Bruynooghe,et al.  A polynomial time computable metric between point sets , 2001, Acta Informatica.

[74]  B. Silverman,et al.  Functional Data Analysis , 1997 .

[75]  Kim C. Border,et al.  Infinite Dimensional Analysis: A Hitchhiker’s Guide , 1994 .

[76]  Barnabás Póczos,et al.  Fast Distribution To Real Regression , 2013, AISTATS.

[77]  Philippe Preux,et al.  A Generalized Kernel Approach to Structured Output Learning , 2013, ICML.

[78]  Naftali Tishby,et al.  Multi-instance learning with any hypothesis class , 2011, J. Mach. Learn. Res..

[79]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[80]  Hans-Georg Ller,et al.  Functional Modelling and Classification of Longitudinal Data. , 2005 .

[81]  Ye Xu,et al.  Non-I.I.D. Multi-Instance Dimensionality Reduction by Learning a Maximum Bag Margin Subspace , 2010, AAAI.

[82]  Zhi-Hua Zhou,et al.  Multi-instance learning by treating instances as non-I.I.D. samples , 2008, ICML '09.

[83]  Philip M. Long,et al.  PAC Learning Axis-aligned Rectangles with Respect to Product Distributions from Multiple-Instance Examples , 1996, COLT '96.

[84]  Alfred O. Hero,et al.  Information-Geometric Dimensionality Reduction , 2011, IEEE Signal Processing Magazine.

[85]  Jiayu Zhou,et al.  Modeling disease progression via multi-task learning , 2013, NeuroImage.

[86]  O. Gaans Probability measures on metric spaces , 2022 .

[87]  Kristin P. Bennett,et al.  Fast Bundle Algorithm for Multiple-Instance Learning , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[88]  Philip M. Long,et al.  PAC Learning Axis-Aligned Rectangles with Respect to Product Distributions from Multiple-Instance Examples , 1996, COLT.

[89]  Dan Zhang,et al.  Multiple Instance Transfer Learning , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[90]  Frank Nielsen,et al.  A closed-form expression for the Sharma–Mittal entropy of exponential families , 2011, ArXiv.

[91]  Ying Chen,et al.  Contextual Hausdorff dissimilarity for multi-instance clustering , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[92]  Heikki Mannila,et al.  Distance measures for point sets and their computation , 1997, Acta Informatica.

[93]  Thomas Gärtner,et al.  Multi-Instance Kernels , 2002, ICML.

[94]  Barnabás Póczos,et al.  Distribution to Distribution Regression , 2013, ICML.

[95]  M. Reed Methods of Modern Mathematical Physics. I: Functional Analysis , 1972 .

[96]  Henry W. Altland,et al.  Applied Functional Data Analysis , 2003, Technometrics.

[97]  Kenji Fukumizu,et al.  Semigroup Kernels on Measures , 2005, J. Mach. Learn. Res..

[98]  James T. Kwok,et al.  Marginalized Multi-Instance Kernels , 2007, IJCAI.

[99]  Samory Kpotufe,et al.  k-NN Regression Adapts to Local Intrinsic Dimension , 2011, NIPS.

[100]  C. Carmeli,et al.  VECTOR VALUED REPRODUCING KERNEL HILBERT SPACES OF INTEGRABLE FUNCTIONS AND MERCER THEOREM , 2006 .

[101]  Bernhard Schölkopf,et al.  Towards a Learning Theory of Causation , 2015, 1502.02398.

[102]  Peter Richtárik,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[103]  Andreas Christmann,et al.  Universal Kernels on Non-Standard Input Spaces , 2010, NIPS.

[104]  Kristin P. Bennett,et al.  Multiple instance ranking , 2008, ICML '08.

[105]  Aapo Hyvärinen,et al.  Density Estimation in Infinite Dimensional Exponential Families , 2013, J. Mach. Learn. Res..

[106]  Fanhua Shang,et al.  Maximum margin multiple-instance feature weighting , 2014, Pattern Recognit..

[107]  Barnabás Póczos,et al.  Support Distribution Machines , 2012, ArXiv.

[108]  Barnabás Póczos,et al.  Fast Function to Function Regression , 2014, AISTATS.

[109]  D. L. Cohn Measure Theory: Second Edition , 2013 .

[110]  Boris Babenko Multiple Instance Learning: Algorithms and Applications , 2008 .

[111]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[112]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[113]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[114]  L. Rosasco,et al.  Less is More: Nystr\"om Computational Regularization , 2015 .

[115]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[116]  Qiang Wu,et al.  A note on application of integral operator in learning theory , 2009 .

[117]  Xu Sun,et al.  Large-Scale Personalized Human Activity Recognition Using Online Multitask Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[118]  Stéphane Canu,et al.  Nonlinear functional regression: a functional RKHS approach , 2010, AISTATS.