论文信息 - Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model - 字舞流文

Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model

In the context of statistical supervised learning, the noiseless linear model assumes that there exists a deterministic linear relation $Y = \langle \theta_*, X \rangle$ between the random output $Y$ and the random feature vector $\Phi(U)$, a potentially non-linear transformation of the inputs $U$. We analyze the convergence of single-pass, fixed step-size stochastic gradient descent on the least-square risk under this model. The convergence of the iterates to the optimum $\theta_*$ and the decay of the generalization error follow polynomial convergence rates with exponents that both depend on the regularities of the optimum $\theta_*$ and of the feature vectors $\Phi(u)$. We interpret our result in the reproducing kernel Hilbert space framework; as a special case, we analyze an online algorithm for estimating a real function on the unit interval from the noiseless observation of its value at randomly sampled points. The convergence depends on the Sobolev smoothness of the function and of a chosen kernel. Finally, we apply our analysis beyond the supervised learning setting to obtain convergence rates for the averaging process (a.k.a. gossip algorithm) on a graph depending on its spectral dimension.

Francis Bach | Pierre Gaillard | Raphael Berthier | F. Bach | Raphael Berthier | P. Gaillard

[1] Yuan Yao,et al. Online Learning as Stochastic Approximation of Regularization Paths: Optimality and Almost-Sure Convergence , 2011, IEEE Transactions on Information Theory.

[2] Ingo Steinwart,et al. Sobolev Norm Learning Rates for Regularized Least-Squares Algorithms , 2017, J. Mach. Learn. Res..

[3] Asuman E. Ozdaglar,et al. Constrained Consensus and Optimization in Multi-Agent Networks , 2008, IEEE Transactions on Automatic Control.

[4] U. Feige,et al. Spectral Graph Theory , 2015 .

[5] Mark W. Schmidt,et al. Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition , 2013, 1308.6370.

[6] F. Bach,et al. Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[7] Alexandre B. Tsybakov,et al. Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[8] David Aldous,et al. A lecture on the averaging process , 2012 .

[9] A. Krzyżak,et al. Optimal global rates of convergence for interpolation problems with random design , 2013 .

[10] Francis Bach,et al. Accelerated Gossip in Networks of Given Dimension Using Jacobi Polynomial Iterations , 2018, SIAM J. Math. Data Sci..

[11] Lorenzo Rosasco,et al. Beating SGD Saturation with Tail-Averaging and Minibatching , 2019, NeurIPS.

[12] G. Wahba. Spline models for observational data , 1990 .

[13] Michael Rabadi,et al. Kernel Methods for Machine Learning , 2015 .

[14] Léon Bottou,et al. On-line learning for very large data sets , 2005 .

[15] Massimiliano Pontil,et al. Online Gradient Descent Learning Algorithms , 2008, Found. Comput. Math..

[16] David Aldous,et al. Interacting particle systems as stochastic social dynamics , 2013, 1309.6766.

[17] É. Remy,et al. Isoperimetry and heat kernel decay on percolation clusters , 2003, math/0301213.

[18] Devavrat Shah,et al. Gossip Algorithms , 2009, Found. Trends Netw..

[19] Raef Bassily,et al. The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[20] Adam Krzyzak,et al. Nonparametric estimation of a function from noiseless observations at random points , 2017, J. Multivar. Anal..

[21] Lorenzo Rosasco,et al. Learning with Incremental Iterative Regularization , 2014, NIPS.

[22] Volkan Cevher,et al. Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral-Regularization Algorithms , 2018, ArXiv.

[23] Adam Krzyzak,et al. A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[24] Mark W. Schmidt,et al. Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[25] Volkan Cevher,et al. On the linear convergence of the stochastic gradient method with constant step-size , 2017, Optim. Lett..

[26] Eric Moulines,et al. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[27] Laura Cottatellucci,et al. Eigenvalues and Spectral Dimension of Random Geometric Graphs in Thermodynamic Regime , 2019, COMPLEX NETWORKS.

[28] Francis R. Bach,et al. Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[29] M. Urner. Scattered Data Approximation , 2016 .

[30] Francesco Orabona,et al. Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration , 2019, NeurIPS.

[31] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[32] Alessandro Rudi,et al. Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes , 2018, NeurIPS.

[33] A. Caponnetto,et al. Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..