Learning from Conditional Distributions via Dual Kernel Embeddings

In many machine learning problems, such as policy evaluation in reinforcement learning and learning with invariance, each data point $x$ itself is a conditional distribution $p(z|x)$, and we want to learn a function $f$ which links these conditional distributions to target values $y$. The learning problem becomes very challenging when we only have limited samples or in the extreme case only one sample from each conditional distribution $p(z|x)$. Commonly used approaches either assume that $z$ is independent of $x$, or require an overwhelmingly large sample size from each conditional distribution. To address these challenges, we propose a novel approach which reformulates the original problem into a min-max optimization problem. In the new view, we only need to deal with the kernel embedding of the joint distribution $p(z,x)$ which is easy to estimate. Furthermore, we design an efficient learning algorithm based on mirror descent stochastic approximation, and establish the sample complexity for learning from conditional distributions. Finally, numerical experiments in both synthetic and real data show that our method can significantly improve over the previous state-of-the-arts.

[1]  Tomaso Poggio,et al.  Incorporating prior information in machine learning by creating virtual examples , 1998, Proc. IEEE.

[2]  S. Canu,et al.  Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[3]  Jean-Philippe Vial,et al.  Robust Optimization , 2021, ICORES.

[4]  Tomaso A. Poggio,et al.  Learning with Group Invariant Features: A Kernel Perspective , 2015, NIPS.

[5]  Guy Lever,et al.  Conditional mean embeddings as regressors , 2012, ICML.

[6]  K. Fukumizu,et al.  Kernel Embeddings of Conditional Distributions: A Unified Kernel Framework for Nonparametric Inference in Graphical Models , 2013, IEEE Signal Process. Mag..

[7]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[8]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[9]  Alexander J. Smola,et al.  A Second Order Cone programming Formulation for Classifying Missing Data , 2004, NIPS.

[10]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[11]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[12]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[13]  R. DeVore,et al.  Optimal nonlinear approximation , 1989 .

[14]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[15]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[16]  J. Hiriart-Urruty,et al.  Fundamentals of Convex Analysis , 2004 .

[17]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[18]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[19]  Ryan M. Rifkin,et al.  Value Regularization and Fenchel Duality , 2007, J. Mach. Learn. Res..

[20]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[21]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[22]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[23]  O. Bousquet,et al.  Kernels, Associated Structures and Generalizations , 2004 .

[24]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[25]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[26]  Andreas Ziehe,et al.  Learning Invariant Representations of Molecules for Atomization Energy Prediction , 2012, NIPS.

[27]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[28]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[29]  Bernhard Schölkopf,et al.  Injective Hilbert Space Embeddings of Probability Measures , 2008, COLT.

[30]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[31]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[32]  Mengdi Wang,et al.  Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions , 2014, Mathematical Programming.

[33]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[34]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[35]  Guy Lever,et al.  Modeling transition dynamics in MDPs with RKHS embeddings of conditional distributions , 2011, ArXiv.

[36]  A. Devinatz Integral representations of positive definite functions , 1953 .

[37]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[38]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? , 2014 .