论文信息 - Nonparametric Compositional Stochastic Optimization for Risk-Sensitive Kernel Learning

Nonparametric Compositional Stochastic Optimization for Risk-Sensitive Kernel Learning

In this work, we address optimization problems where the objective function is a nonlinear function of an expected value, i.e., compositional stochastic programs. We consider the case where the decision variable is not vector-valued but instead belongs to a Reproducing Kernel Hilbert Space (RKHS), motivated by risk-aware formulations of supervised learning. We develop the first memory-efficient stochastic algorithm for this setting, which we call Compositional Online Learning with Kernels (COLK). COLK, at its core a two time-scale stochastic approximation method, addresses the facts that (i) compositions of expected value problems cannot be addressed by stochastic gradient method due to the presence of an inner expectation; and (ii) the RKHS-induced parameterization has complexity which is proportional to the iteration index which is mitigated through greedily constructed subspace projections. We provide, for the first time, a non-asymptotic tradeoff between the complexity of a function parameterization and its required convergence accuracy for both strongly convex and non-convex objectives under constant step-sizes. Experiments with risk-sensitive supervised learning demonstrate that COLK consistently converges and performs reliably even when data is full of outliers, and thus marks a step towards overfitting. Specifically, we observe a favorable tradeoff between model complexity, consistent convergence, and statistical accuracy for data associated with heavy-tailed distributions.

[1] S. Brendle,et al. Calculus of Variations , 1927, Nature.

[2] 中嶋博. Convex Programming の新しい方法 (開学記念号) , 1966 .

[3] G. Wahba,et al. Some results on Tchebycheffian spline functions , 1971 .

[4] A. Zygmund,et al. Measure and integral : an introduction to real analysis , 1977 .

[5] C. D. Bailey. Hamilton's principle and the calculus of variations , 1982 .

[6] Y. Ermoliev. Stochastic quasigradient methods and their application to system optimization , 1983 .

[7] R. Durrett. Probability: Theory and Examples , 1993 .

[8] Y. C. Pati,et al. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[9] S. Hyakin,et al. Neural Networks: A Comprehensive Foundation , 1994 .

[10] J. Mark. Introduction to radial basis function networks , 1996 .

[11] Peter L. Bartlett,et al. Neural Network Learning - Theoretical Foundations , 1999 .

[12] Stan Uryasev,et al. Conditional value-at-risk: optimization algorithms and applications , 2000, Proceedings of the IEEE/IAFE/INFORMS 2000 Conference on Computational Intelligence for Financial Engineering (CIFEr) (Cat. No.00TH8520).

[13] Bernhard Schölkopf,et al. A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[14] Trevor Hastie,et al. The Elements of Statistical Learning , 2001 .

[15] Nello Cristianini,et al. Kernel Methods for Pattern Analysis , 2003, ICTAI.

[16] Eric R. Ziegel,et al. The Elements of Statistical Learning , 2003, Technometrics.

[17] Alexander J. Smola,et al. Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[18] Shie Mannor,et al. The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[19] J. Tsitsiklis,et al. Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[20] Pascal Vincent,et al. Kernel Matching Pursuit , 2002, Machine Learning.

[21] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[22] R. Olfati-Saber,et al. Consensus Filters for Sensor Networks and Distributed Sensor Fusion , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[23] Andrew Packard,et al. Control Applications of Sum of Squares Programming , 2005 .

[24] T. Poggio,et al. The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[25] Shabbir Ahmed,et al. Convexity and decomposition of mean-risk stochastic programs , 2006, Math. Program..

[26] Yuesheng Xu,et al. Universal Kernels , 2006, J. Mach. Learn. Res..

[27] A. Ruszczynski,et al. Optimization of Risk Measures , 2006 .

[28] Randy A. Freeman,et al. Distributed Cooperative Active Sensing Using Consensus Filters , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[29] H. Robbins. A Stochastic Approximation Method , 1951 .

[30] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[31] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[32] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[33] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[34] David Ruppert,et al. Semiparametric regression during 2003-2007. , 2009, Electronic journal of statistics.

[35] Alexander Shapiro,et al. Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[36] Slobodan Vucetic,et al. Online Passive-Aggressive Algorithms on a Budget , 2010, AISTATS.

[37] Sonia Martínez,et al. Discrete-time dynamic average consensus , 2010, Autom..

[38] W. Marsden. I and J , 2012 .

[39] Steven C. H. Hoi,et al. Fast Bounded Online Gradient Descent Algorithms for Scalable Kernel-Based Online Learning , 2012, ICML.

[40] Koby Crammer,et al. Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training , 2012, J. Mach. Learn. Res..

[41] Le Song,et al. Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[42] Jeff G. Schneider,et al. On the Error of Random Fourier Features , 2015, UAI.

[43] Zoltán Szabó,et al. Optimal Rates for Random Fourier Features , 2015, NIPS.

[44] A. Ruszczynski,et al. Statistical estimation of composite risk functionals and risk optimization problems , 2015, 1504.02658.

[45] Gesualdo Scutari,et al. NEXT: In-Network Nonconvex Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[46] Annette ten Teije,et al. Subseries of Lecture Notes in Computer Science , 2016 .

[47] Trung Le,et al. Nonparametric Budgeted Stochastic Gradient Descent , 2016, AISTATS.

[48] Na Li,et al. Harnessing smoothness to accelerate distributed optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[49] Mengdi Wang,et al. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions , 2014, Mathematical Programming.

[50] Mengdi Wang,et al. Finite-sum Composition Optimization via Variance Reduced Gradient Descent , 2016, AISTATS.

[51] Le Song,et al. Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[52] Alejandro Ribeiro,et al. Parsimonious Online Learning with Kernels via sparse projections in function space , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53] P. Stone,et al. Breaking Bellman's Curse of Dimensionality: Efficient Kernel Gradient Temporal Difference , 2017, 1709.04221.

[54] Recursive Optimization of Convex Risk Measures: Mean-Semideviation Models , 2018, 1804.00636.

[55] Ohad Shamir,et al. Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[56] Alejandro Ribeiro,et al. Nonparametric Stochastic Compositional Gradient Descent for Q-Learning in Continuous Markov Decision Problems , 2018, 2018 Annual American Control Conference (ACC).

[57] Antonin Chambolle,et al. On Representer Theorems and Convex Regularization , 2018, SIAM J. Optim..

[58] Gesualdo Scutari,et al. Distributed nonconvex constrained optimization over time-varying digraphs , 2018, Mathematical Programming.

[59] Francesco Orabona,et al. Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[60] Ketan Rajawat,et al. Controlling the Bias-Variance Tradeoff via Coherent Risk for Robust Learning with Kernels , 2019, 2019 American Control Conference (ACC).

[61] Zhu Li,et al. Towards a Unified Analysis of Random Fourier Features , 2018, ICML.

[62] Saeed Ghadimi,et al. A Single Timescale Stochastic Approximation Method for Nested Stochastic Optimization , 2018, SIAM J. Optim..

[63] Brian M. Sadler,et al. Optimally Compressed Nonparametric Online Learning: Tradeoffs between memory and consistency , 2020, IEEE Signal Processing Magazine.

[64] Peter Stone,et al. Policy Evaluation in Continuous MDPs With Efficient Kernelized Gradient Temporal Difference , 2017, IEEE Transactions on Automatic Control.

[65] Angelia Nedic,et al. Distributed stochastic gradient tracking methods , 2018, Mathematical Programming.