Optimal Learning for Multi-pass Stochastic Gradient Methods

We analyze the learning properties of the stochastic gradient method when multiple passes over the data and mini-batches are allowed. In particular, we consider the square loss and show that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping. Moreover, we show that larger step-sizes are allowed when considering mini-batches. Our analysis is based on a unifying approach, encompassing both batch and stochastic gradient methods as special cases.

[1]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[2]  J. Tropp User-Friendly Tools for Random Matrices: An Introduction , 2012 .

[3]  S. Smale,et al.  Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[4]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[5]  Jean-Yves Audibert Optimization for Machine Learning , 1995 .

[6]  Francesco Orabona,et al.  Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning , 2014, NIPS.

[7]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[8]  Massimiliano Pontil,et al.  Online Gradient Descent Learning Algorithms , 2008, Found. Comput. Math..

[9]  Yuan Yao,et al.  Online Learning as Stochastic Approximation of Regularization Paths: Optimality and Almost-Sure Convergence , 2011, IEEE Transactions on Information Theory.

[10]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[11]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[12]  Felipe Cucker,et al.  Learning Theory: An Approximation Theory Viewpoint: Index , 2007 .

[13]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[14]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[15]  Lorenzo Rosasco,et al.  Less is More: Nyström Computational Regularization , 2015, NIPS.

[16]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[17]  Lorenzo Rosasco,et al.  Generalization Properties and Implicit Regularization for Multiple Passes SGM , 2016, ICML.

[18]  Lorenzo Rosasco,et al.  On regularization algorithms in learning theory , 2007, J. Complex..

[19]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[20]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[21]  Stanislav Minsker On Some Extensions of Bernstein's Inequality for Self-adjoint Operators , 2011, 1112.5448.

[22]  Stephen P. Boyd,et al.  Stochastic Subgradient Methods , 2007 .

[23]  I. Pinelis,et al.  Remarks on Inequalities for Large Deviation Probabilities , 1986 .

[24]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[25]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[26]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[27]  Lorenzo Rosasco,et al.  Iterative Regularization for Learning with Convex Loss Functions , 2015, J. Mach. Learn. Res..

[28]  Tong Zhang,et al.  Learning Bounds for Kernel Regression Using Effective Data Dimensionality , 2005, Neural Computation.

[29]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[30]  L. Rosasco,et al.  Less is More: Nystr\"om Computational Regularization , 2015 .