Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.

[1]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[2]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[3]  Volkan Cevher,et al.  Optimal Rates for Spectral-regularized Algorithms with Least-Squares Regression over Hilbert Spaces , 2018, ArXiv.

[4]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[5]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[6]  Alessandro Rudi,et al.  Localized Structured Prediction , 2018, NeurIPS.

[7]  Gilles Blanchard,et al.  Optimal Rates for Regularization of Statistical Inverse Learning Problems , 2016, Found. Comput. Math..

[8]  Massimiliano Pontil,et al.  Online Gradient Descent Learning Algorithms , 2008, Found. Comput. Math..

[9]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[10]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[11]  Alessandro Rudi,et al.  Exponential convergence of testing error for stochastic gradient methods , 2017, COLT.

[12]  Lorenzo Rosasco,et al.  Optimal Rates for Multi-pass Stochastic Gradient Methods , 2016, J. Mach. Learn. Res..

[13]  Gilles Blanchard,et al.  Convergence rates of Kernel Conjugate Gradient for random design regression , 2016 .

[14]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[15]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[16]  Ingo Steinwart,et al.  Sobolev Norm Learning Rates for Regularized Least-Squares Algorithms , 2017, J. Mach. Learn. Res..

[17]  Lorenzo Rosasco,et al.  Spectral Algorithms for Supervised Learning , 2008, Neural Computation.

[18]  Lorenzo Rosasco,et al.  Consistent Multitask Learning with Nonlinear Output Relations , 2017, NIPS.

[19]  Lorenzo Rosasco,et al.  A Consistent Regularization Approach for Structured Prediction , 2016, NIPS.

[20]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[21]  Francis R. Bach,et al.  On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[22]  G. Wahba Spline models for observational data , 1990 .

[23]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[24]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[25]  Lorenzo Rosasco,et al.  Learning with SGD and Random Features , 2018, NeurIPS.

[26]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[27]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[28]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[29]  Eric Moulines,et al.  On a perturbation approach for the analysis of stochastic tracking algorithms , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[30]  Lorenzo Rosasco,et al.  FALKON: An Optimal Large Scale Kernel Method , 2017, NIPS.

[31]  Lorenzo Rosasco,et al.  On the Sample Complexity of Subspace Learning , 2013, NIPS.

[32]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[33]  Volkan Cevher,et al.  Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces , 2018, Applied and Computational Harmonic Analysis.

[34]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[35]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[36]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[37]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[38]  Francis R. Bach,et al.  On Structured Prediction Theory with Calibrated Convex Surrogate Losses , 2017, NIPS.

[39]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[40]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.