论文信息 - Nonparametric Compositional Stochastic Optimization for Risk-Sensitive Kernel Learning

Nonparametric Compositional Stochastic Optimization for Risk-Sensitive Kernel Learning

In this work, we address optimization problems where the objective function is a nonlinear function of an expected value, i.e., compositional stochastic programs. We consider the case where the decision variable is not vector-valued but instead belongs to a Reproducing Kernel Hilbert Space (RKHS), motivated by risk-aware formulations of supervised learning. We develop the first memory-efficient stochastic algorithm for this setting, which we call Compositional Online Learning with Kernels (COLK). COLK, at its core a two time-scale stochastic approximation method, addresses the facts that (i) compositions of expected value problems cannot be addressed by stochastic gradient method due to the presence of an inner expectation; and (ii) the RKHS-induced parameterization has complexity which is proportional to the iteration index which is mitigated through greedily constructed subspace projections. We provide, for the first time, a non-asymptotic tradeoff between the complexity of a function parameterization and its required convergence accuracy for both strongly convex and non-convex objectives under constant step-sizes. Experiments with risk-sensitive supervised learning demonstrate that COLK consistently converges and performs reliably even when data is full of outliers, and thus marks a step towards overfitting. Specifically, we observe a favorable tradeoff between model complexity, consistent convergence, and statistical accuracy for data associated with heavy-tailed distributions.

[1] 中嶋博. Convex Programming の新しい方法 (開学記念号) , 1966 .

[2] A. Ruszczynski,et al. Optimization of Risk Measures , 2006 .

[3] Shie Mannor,et al. The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[4] Jeff G. Schneider,et al. On the Error of Random Fourier Features , 2015, UAI.

[5] Mengdi Wang,et al. Finite-sum Composition Optimization via Variance Reduced Gradient Descent , 2016, AISTATS.

[6] Shabbir Ahmed,et al. Convexity and decomposition of mean-risk stochastic programs , 2006, Math. Program..

[7] Zhu Li,et al. Towards a Unified Analysis of Random Fourier Features , 2018, ICML.

[8] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[9] Y. Ermoliev. Stochastic quasigradient methods and their application to system optimization , 1983 .

[10] Mengdi Wang,et al. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions , 2014, Mathematical Programming.

[11] Zoltán Szabó,et al. Optimal Rates for Random Fourier Features , 2015, NIPS.

[12] S. Brendle,et al. Calculus of Variations , 1927, Nature.

[13] Francesco Orabona,et al. Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[14] Slobodan Vucetic,et al. Online Passive-Aggressive Algorithms on a Budget , 2010, AISTATS.

[15] Brian M. Sadler,et al. Optimally Compressed Nonparametric Online Learning: Tradeoffs between memory and consistency , 2020, IEEE Signal Processing Magazine.

[16] Gesualdo Scutari,et al. Distributed nonconvex constrained optimization over time-varying digraphs , 2018, Mathematical Programming.

[17] Koby Crammer,et al. Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training , 2012, J. Mach. Learn. Res..

[18] Annette ten Teije,et al. Subseries of Lecture Notes in Computer Science , 2016 .

[19] Recursive Optimization of Convex Risk Measures: Mean-Semideviation Models , 2018, 1804.00636.

[20] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[21] David Ruppert,et al. Semiparametric regression during 2003-2007. , 2009, Electronic journal of statistics.

[22] Antonin Chambolle,et al. On Representer Theorems and Convex Regularization , 2018, SIAM J. Optim..

[23] R. Olfati-Saber,et al. Consensus Filters for Sensor Networks and Distributed Sensor Fusion , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[24] A. Ruszczynski,et al. Statistical estimation of composite risk functionals and risk optimization problems , 2015, 1504.02658.

[25] Ketan Rajawat,et al. Controlling the Bias-Variance Tradeoff via Coherent Risk for Robust Learning with Kernels , 2019, 2019 American Control Conference (ACC).

[26] T. Poggio,et al. The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[27] C. D. Bailey. Hamilton's principle and the calculus of variations , 1982 .

[28] S. Hyakin,et al. Neural Networks: A Comprehensive Foundation , 1994 .

[29] H. Robbins. A Stochastic Approximation Method , 1951 .

[30] Stan Uryasev,et al. Conditional value-at-risk: optimization algorithms and applications , 2000, Proceedings of the IEEE/IAFE/INFORMS 2000 Conference on Computational Intelligence for Financial Engineering (CIFEr) (Cat. No.00TH8520).

[31] Trevor Hastie,et al. The Elements of Statistical Learning , 2001 .

[32] Gesualdo Scutari,et al. NEXT: In-Network Nonconvex Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[33] Le Song,et al. Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[34] Alejandro Ribeiro,et al. Nonparametric Stochastic Compositional Gradient Descent for Q-Learning in Continuous Markov Decision Problems , 2018, 2018 Annual American Control Conference (ACC).

[35] Randy A. Freeman,et al. Distributed Cooperative Active Sensing Using Consensus Filters , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[36] Na Li,et al. Harnessing smoothness to accelerate distributed optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[37] Yuesheng Xu,et al. Universal Kernels , 2006, J. Mach. Learn. Res..

[38] Peter L. Bartlett,et al. Neural Network Learning - Theoretical Foundations , 1999 .

[39] Sonia Martínez,et al. Discrete-time dynamic average consensus , 2010, Autom..

[40] Alejandro Ribeiro,et al. Parsimonious Online Learning with Kernels via sparse projections in function space , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Trung Le,et al. Nonparametric Budgeted Stochastic Gradient Descent , 2016, AISTATS.

[42] Peter Stone,et al. Policy Evaluation in Continuous MDPs With Efficient Kernelized Gradient Temporal Difference , 2017, IEEE Transactions on Automatic Control.

[43] Ohad Shamir,et al. Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[44] Angelia Nedic,et al. Distributed stochastic gradient tracking methods , 2018, Mathematical Programming.

[45] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[46] Nello Cristianini,et al. Kernel Methods for Pattern Analysis , 2003, ICTAI.

[47] W. Marsden. I and J , 2012 .

[48] Le Song,et al. Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[49] Y. C. Pati,et al. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[50] Alexander J. Smola,et al. Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[51] Pascal Vincent,et al. Kernel Matching Pursuit , 2002, Machine Learning.

[52] Alexander Shapiro,et al. Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[53] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[54] Eric R. Ziegel,et al. The Elements of Statistical Learning , 2003, Technometrics.

[55] Bernhard Schölkopf,et al. A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[56] Saeed Ghadimi,et al. A Single Timescale Stochastic Approximation Method for Nested Stochastic Optimization , 2018, SIAM J. Optim..

[57] J. Tsitsiklis,et al. Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[58] Andrew Packard,et al. Control Applications of Sum of Squares Programming , 2005 .

[59] A. Zygmund,et al. Measure and integral : an introduction to real analysis , 1977 .

[60] G. Wahba,et al. Some results on Tchebycheffian spline functions , 1971 .

[61] Steven C. H. Hoi,et al. Fast Bounded Online Gradient Descent Algorithms for Scalable Kernel-Based Online Learning , 2012, ICML.

[62] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[63] J. Mark. Introduction to radial basis function networks , 1996 .

[64] R. Durrett. Probability: Theory and Examples , 1993 .

[65] P. Stone,et al. Breaking Bellman's Curse of Dimensionality: Efficient Kernel Gradient Temporal Difference , 2017, 1709.04221.