Projected Stochastic Primal-Dual Method for Constrained Online Learning With Kernels

We consider the problem of stochastic optimization with nonlinear constraints, where the decision variable is not vector-valued but instead a function belonging to a reproducing Kernel Hilbert Space (RKHS). Currently, there exist solutions to only special cases of this problem. To solve this constrained problem with kernels, we first generalize the Representer Theorem to a class of saddle-point problems defined over RKHS. Then, we develop a primal-dual method which that executes alternating projected primal/dual stochastic gradient descent/ascent on the dual-augmented Lagrangian of the problem. The primal projection sets are low-dimensional subspaces of the ambient function space, which are greedily constructed using matching pursuit. By tuning the projection-induced error to the algorithm step-size, we are able to establish mean convergence in both primal objective sub-optimality and constraint violation, to respective <inline-formula><tex-math notation="LaTeX">${\mathcal O}(\sqrt{T})$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">${\mathcal O}(T^{3/4})$</tex-math></inline-formula> neighborhoods. Here, <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula> is the final iteration index and the constant step-size is chosen as <inline-formula><tex-math notation="LaTeX">$1/\sqrt{T}$</tex-math></inline-formula> with <inline-formula><tex-math notation="LaTeX">$1/T$</tex-math></inline-formula> approximation budget. Finally, we demonstrate experimentally the effectiveness of the proposed method for risk-aware supervised learning.

[1]  Alejandro Ribeiro,et al.  Navigation Functions for Convex Potentials in a Space With Convex Obstacles , 2016, IEEE Transactions on Automatic Control.

[2]  Deanna Needell,et al.  Linear Convergence of Stochastic Iterative Greedy Algorithms With Sparse Constraints , 2014, IEEE Transactions on Information Theory.

[3]  Zhao Zhang,et al.  Spectrum prediction and channel selection for sensing-based spectrum sharing scheme using online learning techniques , 2015, 2015 IEEE 26th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC).

[4]  Ketan Rajawat,et al.  EXACT NONPARAMETRIC DECENTRALIZED ONLINE OPTIMIZATION , 2018, 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[5]  Sergios Theodoridis,et al.  Online Learning in Reproducing Kernel Hilbert Spaces , 2014 .

[6]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[7]  Shabbir Ahmed,et al.  Convexity and decomposition of mean-risk stochastic programs , 2006, Math. Program..

[8]  Alejandro Ribeiro,et al.  Parsimonious Online Learning with Kernels via sparse projections in function space , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Vladimir I. Norkin,et al.  On Stochastic Optimization and Statistical Learning in Reproducing Kernel Hilbert Spaces by Support Vector Machines (SVM) , 2009, Informatica.

[12]  Cédric Richard,et al.  Decentralized Online Learning With Kernels , 2017, IEEE Transactions on Signal Processing.

[13]  S. Vajda Studies in Linear and Non-Linear Programming. (Stanford Mathematical Studies in the Social Sciences.) By K. J. Arrow, L. Hurwicz, and H. Uzawa. Pp. 229. 60s. 1958. (Stanford Univ. Press) , 1960, The Mathematical Gazette.

[14]  Koby Crammer,et al.  Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training , 2012, J. Mach. Learn. Res..

[15]  Andrew Packard,et al.  Control Applications of Sum of Squares Programming , 2005 .

[16]  Rajesh Arora,et al.  Optimization: Algorithms and Applications , 2015 .

[17]  David Ruppert,et al.  Semiparametric regression during 2003-2007. , 2009, Electronic journal of statistics.

[18]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[19]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[20]  C. D. Bailey Hamilton's principle and the calculus of variations , 1982 .

[21]  Hisashi Tanizaki,et al.  Nonlinear Filters: Estimation and Applications , 1993 .

[22]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[23]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[24]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[25]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[26]  Hao Zhu,et al.  Projected Stochastic Primal-Dual Method for Constrained Online Learning with Kernels , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[27]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[28]  Rong Jin,et al.  Trading regret for efficiency: online convex optimization with long term constraints , 2011, J. Mach. Learn. Res..

[29]  R. Bellman Calculus of Variations (L. E. Elsgolc) , 1963 .

[30]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[31]  Richard G. Baraniuk,et al.  Random Filters for Compressive Sampling and Reconstruction , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[32]  Pascal Vincent,et al.  Kernel Matching Pursuit , 2002, Machine Learning.

[33]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[34]  Sergios Theodoridis,et al.  Adaptive Constrained Learning in Reproducing Kernel Hilbert Spaces: The Robust Beamforming Case , 2009, IEEE Transactions on Signal Processing.

[35]  Angelia Nedic,et al.  Subgradient Methods for Saddle-Point Problems , 2009, J. Optimization Theory and Applications.

[36]  Alejandro Ribeiro,et al.  Ergodic Stochastic Optimization Algorithms for Wireless Communication and Networking , 2010, IEEE Transactions on Signal Processing.

[37]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[38]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[39]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[40]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[41]  Byron Boots,et al.  Functional Gradient Motion Planning in Reproducing Kernel Hilbert Spaces , 2016, Robotics: Science and Systems.

[42]  Brian M. Sadler,et al.  Proximity without consensus in online multi-agent optimization , 2016, ICASSP.

[43]  Cédric Archambeau,et al.  Online optimization and regret guarantees for non-additive long-term constraints , 2016, ArXiv.

[44]  Amir-massoud Farahmand,et al.  Learning Positive Functions in a Hilbert Space , 2015 .

[45]  Alejandro Ribeiro,et al.  Safe online navigation of convex potentials in spaces with convex obstacles , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[46]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[47]  Charles Richter,et al.  Polynomial Trajectory Planning for Aggressive Quadrotor Flight in Dense Indoor Environments , 2016, ISRR.

[48]  Alexander Shapiro,et al.  Convex Approximations of Chance Constrained Programs , 2006, SIAM J. Optim..