Fast and strong convergence of online learning algorithms

In this paper, we study the online learning algorithm without explicit regularization terms. This algorithm is essentially a stochastic gradient descent scheme in a reproducing kernel Hilbert space (RKHS). The polynomially decaying step size in each iteration can play a role of regularization to ensure the generalization ability of online learning algorithm. We develop a novel capacity dependent analysis on the performance of the last iterate of online learning algorithm. This answers an open problem in learning theory. The contribution of this paper is twofold. First, our novel capacity dependent analysis can lead to sharp convergence rate in the standard mean square distance which improves the results in the literature. Second, we establish, for the first time, the strong convergence of the last iterate with polynomially decaying step sizes in the RKHS norm. We demonstrate that the theoretical analysis established in this paper fully exploits the fine structure of the underlying RKHS, and thus can lead to sharp error estimates of online learning algorithm.

[1]  Holger Wendland,et al.  Scattered Data Approximation: Conditionally positive definite functions , 2004 .

[2]  Jan Mikusiński,et al.  The Bochner Integral , 1978 .

[3]  Joachim M. Buhmann,et al.  On Relevant Dimensions in Kernel Feature Spaces , 2008, J. Mach. Learn. Res..

[4]  Lei Shi,et al.  Convergence of Unregularized Online Learning Algorithms , 2017, J. Mach. Learn. Res..

[5]  Yiming Ying,et al.  Unregularized Online Learning Algorithms with General Loss Functions , 2015, ArXiv.

[6]  Yiming Ying,et al.  Convergence analysis of online algorithms , 2007, Adv. Comput. Math..

[7]  Ding-Xuan Zhou,et al.  The covering number in learning theory , 2002, J. Complex..

[8]  Yuan Yao,et al.  On Complexity Issues of Online Learning Algorithms , 2010, IEEE Transactions on Information Theory.

[9]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[10]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[11]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[12]  S. Smale,et al.  Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[13]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[14]  Yuan Yao,et al.  Online Learning Algorithms , 2006, Found. Comput. Math..

[15]  Felipe Cucker,et al.  Learning Theory: An Approximation Theory Viewpoint: Index , 2007 .

[16]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[17]  S. Mendelson,et al.  Regularization in kernel learning , 2010, 1001.2094.

[18]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[19]  S. Smale,et al.  ESTIMATING THE APPROXIMATION ERROR IN LEARNING THEORY , 2003 .

[20]  Yiming Ying,et al.  Online regularized learning with pairwise loss functions , 2016, Advances in Computational Mathematics.

[21]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[22]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[23]  Massimiliano Pontil,et al.  Online Gradient Descent Learning Algorithms , 2008, Found. Comput. Math..

[24]  Lorenzo Rosasco,et al.  Optimal Learning for Multi-pass Stochastic Gradient Methods , 2016, NIPS.

[25]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[26]  F. Bach,et al.  NONPARAMETRIC STOCHASTIC APPROXIMATION WITH LARGE STEP-SIZES1 BY AYMERIC DIEULEVEUT , 2016 .

[27]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[28]  Alessandro Rudi,et al.  Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes , 2018, NeurIPS.

[29]  M. Birman,et al.  PIECEWISE-POLYNOMIAL APPROXIMATIONS OF FUNCTIONS OF THE CLASSES $ W_{p}^{\alpha}$ , 1967 .

[30]  Yiming Ying,et al.  Online Regularized Classification Algorithms , 2006, IEEE Transactions on Information Theory.

[31]  Ding-Xuan Zhou,et al.  Distributed Learning with Regularized Least Squares , 2016, J. Mach. Learn. Res..

[32]  Stephan Didas,et al.  Combined ℓ2 data and gradient fitting in conjunction with ℓ1 regularization , 2009, Adv. Comput. Math..

[33]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[34]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[35]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[36]  Yiming Ying,et al.  Online Pairwise Learning Algorithms , 2016, Neural Computation.

[37]  Yuan Yao,et al.  Online Learning as Stochastic Approximation of Regularization Paths: Optimality and Almost-Sure Convergence , 2011, IEEE Transactions on Information Theory.

[38]  Chong Gu Smoothing Spline Anova Models , 2002 .

[39]  Lorenzo Rosasco,et al.  Regularization by Early Stopping for Online Learning Algorithms , 2014, ArXiv.

[40]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[41]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[42]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[43]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[44]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .