Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model

In the context of statistical supervised learning, the noiseless linear model assumes that there exists a deterministic linear relation $Y = \langle \theta_*, X \rangle$ between the random output $Y$ and the random feature vector $\Phi(U)$, a potentially non-linear transformation of the inputs $U$. We analyze the convergence of single-pass, fixed step-size stochastic gradient descent on the least-square risk under this model. The convergence of the iterates to the optimum $\theta_*$ and the decay of the generalization error follow polynomial convergence rates with exponents that both depend on the regularities of the optimum $\theta_*$ and of the feature vectors $\Phi(u)$. We interpret our result in the reproducing kernel Hilbert space framework; as a special case, we analyze an online algorithm for estimating a real function on the unit interval from the noiseless observation of its value at randomly sampled points. The convergence depends on the Sobolev smoothness of the function and of a chosen kernel. Finally, we apply our analysis beyond the supervised learning setting to obtain convergence rates for the averaging process (a.k.a. gossip algorithm) on a graph depending on its spectral dimension.

[1]  Yuan Yao,et al.  Online Learning as Stochastic Approximation of Regularization Paths: Optimality and Almost-Sure Convergence , 2011, IEEE Transactions on Information Theory.

[2]  Ingo Steinwart,et al.  Sobolev Norm Learning Rates for Regularized Least-Squares Algorithms , 2017, J. Mach. Learn. Res..

[3]  Asuman E. Ozdaglar,et al.  Constrained Consensus and Optimization in Multi-Agent Networks , 2008, IEEE Transactions on Automatic Control.

[4]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[5]  Mark W. Schmidt,et al.  Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition , 2013, 1308.6370.

[6]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[7]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[8]  David Aldous,et al.  A lecture on the averaging process , 2012 .

[9]  A. Krzyżak,et al.  Optimal global rates of convergence for interpolation problems with random design , 2013 .

[10]  Francis Bach,et al.  Accelerated Gossip in Networks of Given Dimension Using Jacobi Polynomial Iterations , 2018, SIAM J. Math. Data Sci..

[11]  Lorenzo Rosasco,et al.  Beating SGD Saturation with Tail-Averaging and Minibatching , 2019, NeurIPS.

[12]  G. Wahba Spline models for observational data , 1990 .

[13]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[14]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[15]  Massimiliano Pontil,et al.  Online Gradient Descent Learning Algorithms , 2008, Found. Comput. Math..

[16]  David Aldous,et al.  Interacting particle systems as stochastic social dynamics , 2013, 1309.6766.

[17]  É. Remy,et al.  Isoperimetry and heat kernel decay on percolation clusters , 2003, math/0301213.

[18]  Devavrat Shah,et al.  Gossip Algorithms , 2009, Found. Trends Netw..

[19]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[20]  Adam Krzyzak,et al.  Nonparametric estimation of a function from noiseless observations at random points , 2017, J. Multivar. Anal..

[21]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[22]  Volkan Cevher,et al.  Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral-Regularization Algorithms , 2018, ArXiv.

[23]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[24]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[25]  Volkan Cevher,et al.  On the linear convergence of the stochastic gradient method with constant step-size , 2017, Optim. Lett..

[26]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[27]  Laura Cottatellucci,et al.  Eigenvalues and Spectral Dimension of Random Geometric Graphs in Thermodynamic Regime , 2019, COMPLEX NETWORKS.

[28]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[29]  M. Urner Scattered Data Approximation , 2016 .

[30]  Francesco Orabona,et al.  Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration , 2019, NeurIPS.

[31]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[32]  Alessandro Rudi,et al.  Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes , 2018, NeurIPS.

[33]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..