Nonparametric Compositional Stochastic Optimization for Risk-Sensitive Kernel Learning

In this work, we address optimization problems where the objective function is a nonlinear function of an expected value, i.e., compositional stochastic programs. We consider the case where the decision variable is not vector-valued but instead belongs to a Reproducing Kernel Hilbert Space (RKHS), motivated by risk-aware formulations of supervised learning. We develop the first memory-efficient stochastic algorithm for this setting, which we call Compositional Online Learning with Kernels (COLK). COLK, at its core a two time-scale stochastic approximation method, addresses the facts that (i) compositions of expected value problems cannot be addressed by stochastic gradient method due to the presence of an inner expectation; and (ii) the RKHS-induced parameterization has complexity which is proportional to the iteration index which is mitigated through greedily constructed subspace projections. We provide, for the first time, a non-asymptotic tradeoff between the complexity of a function parameterization and its required convergence accuracy for both strongly convex and non-convex objectives under constant step-sizes. Experiments with risk-sensitive supervised learning demonstrate that COLK consistently converges and performs reliably even when data is full of outliers, and thus marks a step towards overfitting. Specifically, we observe a favorable tradeoff between model complexity, consistent convergence, and statistical accuracy for data associated with heavy-tailed distributions.

[1]  S. Brendle,et al.  Calculus of Variations , 1927, Nature.

[2]  中嶋 博 Convex Programming の新しい方法 (開学記念号) , 1966 .

[3]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[4]  A. Zygmund,et al.  Measure and integral : an introduction to real analysis , 1977 .

[5]  C. D. Bailey Hamilton's principle and the calculus of variations , 1982 .

[6]  Y. Ermoliev Stochastic quasigradient methods and their application to system optimization , 1983 .

[7]  R. Durrett Probability: Theory and Examples , 1993 .

[8]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[9]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[10]  J. Mark Introduction to radial basis function networks , 1996 .

[11]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[12]  Stan Uryasev,et al.  Conditional value-at-risk: optimization algorithms and applications , 2000, Proceedings of the IEEE/IAFE/INFORMS 2000 Conference on Computational Intelligence for Financial Engineering (CIFEr) (Cat. No.00TH8520).

[13]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[14]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[15]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[16]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[17]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[18]  Shie Mannor,et al.  The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[19]  J. Tsitsiklis,et al.  Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[20]  Pascal Vincent,et al.  Kernel Matching Pursuit , 2002, Machine Learning.

[21]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[22]  R. Olfati-Saber,et al.  Consensus Filters for Sensor Networks and Distributed Sensor Fusion , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[23]  Andrew Packard,et al.  Control Applications of Sum of Squares Programming , 2005 .

[24]  T. Poggio,et al.  The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[25]  Shabbir Ahmed,et al.  Convexity and decomposition of mean-risk stochastic programs , 2006, Math. Program..

[26]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[27]  A. Ruszczynski,et al.  Optimization of Risk Measures , 2006 .

[28]  Randy A. Freeman,et al.  Distributed Cooperative Active Sensing Using Consensus Filters , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[29]  H. Robbins A Stochastic Approximation Method , 1951 .

[30]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[31]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[32]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[33]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[34]  David Ruppert,et al.  Semiparametric regression during 2003-2007. , 2009, Electronic journal of statistics.

[35]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[36]  Slobodan Vucetic,et al.  Online Passive-Aggressive Algorithms on a Budget , 2010, AISTATS.

[37]  Sonia Martínez,et al.  Discrete-time dynamic average consensus , 2010, Autom..

[38]  W. Marsden I and J , 2012 .

[39]  Steven C. H. Hoi,et al.  Fast Bounded Online Gradient Descent Algorithms for Scalable Kernel-Based Online Learning , 2012, ICML.

[40]  Koby Crammer,et al.  Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training , 2012, J. Mach. Learn. Res..

[41]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[42]  Jeff G. Schneider,et al.  On the Error of Random Fourier Features , 2015, UAI.

[43]  Zoltán Szabó,et al.  Optimal Rates for Random Fourier Features , 2015, NIPS.

[44]  A. Ruszczynski,et al.  Statistical estimation of composite risk functionals and risk optimization problems , 2015, 1504.02658.

[45]  Gesualdo Scutari,et al.  NEXT: In-Network Nonconvex Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[46]  Annette ten Teije,et al.  Subseries of Lecture Notes in Computer Science , 2016 .

[47]  Trung Le,et al.  Nonparametric Budgeted Stochastic Gradient Descent , 2016, AISTATS.

[48]  Na Li,et al.  Harnessing smoothness to accelerate distributed optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[49]  Mengdi Wang,et al.  Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions , 2014, Mathematical Programming.

[50]  Mengdi Wang,et al.  Finite-sum Composition Optimization via Variance Reduced Gradient Descent , 2016, AISTATS.

[51]  Le Song,et al.  Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[52]  Alejandro Ribeiro,et al.  Parsimonious Online Learning with Kernels via sparse projections in function space , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  P. Stone,et al.  Breaking Bellman's Curse of Dimensionality: Efficient Kernel Gradient Temporal Difference , 2017, 1709.04221.

[54]  Recursive Optimization of Convex Risk Measures: Mean-Semideviation Models , 2018, 1804.00636.

[55]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[56]  Alejandro Ribeiro,et al.  Nonparametric Stochastic Compositional Gradient Descent for Q-Learning in Continuous Markov Decision Problems , 2018, 2018 Annual American Control Conference (ACC).

[57]  Antonin Chambolle,et al.  On Representer Theorems and Convex Regularization , 2018, SIAM J. Optim..

[58]  Gesualdo Scutari,et al.  Distributed nonconvex constrained optimization over time-varying digraphs , 2018, Mathematical Programming.

[59]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[60]  Ketan Rajawat,et al.  Controlling the Bias-Variance Tradeoff via Coherent Risk for Robust Learning with Kernels , 2019, 2019 American Control Conference (ACC).

[61]  Zhu Li,et al.  Towards a Unified Analysis of Random Fourier Features , 2018, ICML.

[62]  Saeed Ghadimi,et al.  A Single Timescale Stochastic Approximation Method for Nested Stochastic Optimization , 2018, SIAM J. Optim..

[63]  Brian M. Sadler,et al.  Optimally Compressed Nonparametric Online Learning: Tradeoffs between memory and consistency , 2020, IEEE Signal Processing Magazine.

[64]  Peter Stone,et al.  Policy Evaluation in Continuous MDPs With Efficient Kernelized Gradient Temporal Difference , 2017, IEEE Transactions on Automatic Control.

[65]  Angelia Nedic,et al.  Distributed stochastic gradient tracking methods , 2018, Mathematical Programming.