论文信息 - Optimal Learning for Multi-pass Stochastic Gradient Methods - 字舞流文

Optimal Learning for Multi-pass Stochastic Gradient Methods

We analyze the learning properties of the stochastic gradient method when multiple passes over the data and mini-batches are allowed. In particular, we consider the square loss and show that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping. Moreover, we show that larger step-sizes are allowed when considering mini-batches. Our analysis is based on a unifying approach, encompassing both batch and stochastic gradient methods as special cases.

Lorenzo Rosasco | Junhong Lin | L. Rosasco | Junhong Lin

[1] A. Caponnetto,et al. Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[2] J. Tropp. User-Friendly Tools for Random Matrices: An Introduction , 2012 .

[3] S. Smale,et al. Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[4] Yoram Singer,et al. Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[5] Jean-Yves Audibert. Optimization for Machine Learning , 1995 .

[6] Francesco Orabona,et al. Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning , 2014, NIPS.

[7] Y. Yao,et al. On Early Stopping in Gradient Descent Learning , 2007 .

[8] Massimiliano Pontil,et al. Online Gradient Descent Learning Algorithms , 2008, Found. Comput. Math..

[9] Yuan Yao,et al. Online Learning as Stochastic Approximation of Regularization Paths: Optimality and Almost-Sure Convergence , 2011, IEEE Transactions on Information Theory.

[10] Claudio Gentile,et al. On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[11] Andreas Christmann,et al. Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[12] Felipe Cucker,et al. Learning Theory: An Approximation Theory Viewpoint: Index , 2007 .

[13] Yoram Singer,et al. Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[14] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[15] Lorenzo Rosasco,et al. Less is More: Nyström Computational Regularization , 2015, NIPS.

[16] Ohad Shamir,et al. Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[17] Lorenzo Rosasco,et al. Generalization Properties and Implicit Regularization for Multiple Passes SGM , 2016, ICML.

[18] Lorenzo Rosasco,et al. On regularization algorithms in learning theory , 2007, J. Complex..

[19] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[20] Ohad Shamir,et al. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[21] Stanislav Minsker. On Some Extensions of Bernstein's Inequality for Self-adjoint Operators , 2011, 1112.5448.

[22] Stephen P. Boyd,et al. Stochastic Subgradient Methods , 2007 .

[23] I. Pinelis,et al. Remarks on Inequalities for Large Deviation Probabilities , 1986 .

[24] F. Bach,et al. Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[25] Lorenzo Rosasco,et al. Learning with Incremental Iterative Regularization , 2014, NIPS.

[26] Ohad Shamir,et al. Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[27] Lorenzo Rosasco,et al. Iterative Regularization for Learning with Convex Loss Functions , 2015, J. Mach. Learn. Res..

[28] Tong Zhang,et al. Learning Bounds for Kernel Regression Using Effective Data Dimensionality , 2005, Neural Computation.

[29] O. Nelles,et al. An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[30] L. Rosasco,et al. Less is More: Nystr\"om Computational Regularization , 2015 .