Exponential convergence of testing error for stochastic gradient methods

We consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing loss (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing error (classification error) converges exponentially fast if low-noise conditions are assumed.

[1]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[2]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[3]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[4]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[5]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[6]  Mikhail V. Solodov,et al.  Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero , 1998, Comput. Optim. Appl..

[7]  E. D. Vito,et al.  Learning Sets with Separating Kernels , 2012, 1204.3573.

[8]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[9]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[10]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[11]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[12]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[13]  I. Pinelis OPTIMUM BOUNDS FOR THE DISTRIBUTIONS OF MARTINGALES IN BANACH SPACES , 1994, 1208.2200.

[14]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[15]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[16]  H. Robbins A Stochastic Approximation Method , 1951 .

[17]  Lorenzo Rosasco,et al.  FALKON: An Optimal Large Scale Kernel Method , 2017, NIPS.

[18]  Francis R. Bach,et al.  Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2014, AISTATS 2014.

[19]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[20]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[21]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[22]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[23]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[24]  Ambuj Tewari,et al.  On the Generalization Ability of Online Strongly Convex Programming Algorithms , 2008, NIPS.

[25]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[26]  Mikhail Belkin,et al.  On Learning with Integral Operators , 2010, J. Mach. Learn. Res..

[27]  Prateek Jain,et al.  Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.

[28]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[29]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[30]  Lorenzo Rosasco,et al.  A Consistent Regularization Approach for Structured Prediction , 2016, NIPS.

[31]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[32]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[33]  Mark W. Schmidt,et al.  Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition , 2013, 1308.6370.

[34]  Yurii Nesterov,et al.  Confidence level solutions for stochastic programming , 2000, Autom..

[35]  Vladimir Koltchinskii,et al.  Exponential Convergence Rates in Classification , 2005, COLT.

[36]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[37]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[38]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[39]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[40]  Francis R. Bach,et al.  On Structured Prediction Theory with Calibrated Convex Surrogate Losses , 2017, NIPS.

[41]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..