The Benefits of Implicit Regularization from SGD in Least Squares Problems

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.

[1]  Stefan Wager,et al.  High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.

[2]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[3]  Prateek Jain,et al.  A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares) , 2017, FSTTCS.

[4]  Vladimir Braverman,et al.  Benign Overfitting of Constant-Stepsize SGD for Linear Regression , 2021, COLT.

[5]  Prateek Jain,et al.  Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification , 2016, J. Mach. Learn. Res..

[6]  A. Tsigler,et al.  Benign overfitting in ridge regression , 2020 .

[7]  Dmitry Kobak,et al.  The Optimal Ridge Penalty for Real-world High-dimensional Data Can Be Zero or Negative due to the Implicit Ridge Regularization , 2020, J. Mach. Learn. Res..

[8]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[9]  Sham M. Kakade,et al.  Random Design Analysis of Ridge Regression , 2012, COLT.

[10]  Francis R. Bach,et al.  Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2015, AISTATS.

[11]  R. R. Bahadur Some Limit Theorems in Statistics , 1987 .

[12]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[13]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[14]  Sham M. Kakade,et al.  A risk comparison of ordinary least squares vs ridge regression , 2011, J. Mach. Learn. Res..

[15]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[16]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[17]  Ji Xu,et al.  On the number of variables to use in principal component regression , 2019 .

[18]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[19]  J. Zico Kolter,et al.  A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.

[20]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[21]  Dimitris Achlioptas,et al.  Bad Global Minima Exist and SGD Can Reach Them , 2019, NeurIPS.

[22]  Pradeep Ravikumar,et al.  Connecting Optimization and Regularization Paths , 2018, NeurIPS.

[23]  R. R. Bahadur Rates of Convergence of Estimates and Test Statistics , 1967 .

[24]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[25]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[26]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[27]  Ji Xu,et al.  On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression , 2020, NeurIPS.

[28]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[29]  Roi Livni,et al.  Can Implicit Bias Explain Generalization? Stochastic Convex Optimization as a Case Study , 2020, NeurIPS.