论文信息 - More Data Can Hurt for Linear Regression: Sample-wise Double Descent

More Data Can Hurt for Linear Regression: Sample-wise Double Descent

In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing "double-descent" phenomenon in linear models. In this note, we isolate and understand this behavior in an extremely simple setting: linear regression with isotropic Gaussian covariates. In particular, this occurs due to an unconventional type of bias-variance tradeoff in the overparameterized regime: the bias decreases with more samples, but variance increases.

Preetum Nakkiran | Preetum Nakkiran

[1] Andrea Montanari,et al. The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[2] Mikhail Belkin,et al. Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[3] Philip M. Long,et al. Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[4] Michael W. Mahoney,et al. Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[5] Surya Ganguli,et al. An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.

[6] Tengyuan Liang,et al. On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, ArXiv.

[7] Andrew M. Saxe,et al. High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[8] Boaz Barak,et al. Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[9] Andrea Montanari,et al. Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[10] Levent Sagun,et al. A jamming transition from under- to over-parametrization affects generalization in deep learning , 2018, Journal of Physics A: Mathematical and Theoretical.

[11] Ioannis Mitliagkas,et al. A Modern Take on the Bias-Variance Tradeoff in Neural Networks , 2018, ArXiv.

[12] Partha P Mitra,et al. Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation , 2019, ArXiv.

[13] Levent Sagun,et al. The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[14] V. Marčenko,et al. DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[15] Ji Xu,et al. On the number of variables to use in principal component regression , 2019 .

[16] Anant Sahai,et al. Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[17] M. Opper. Statistical Mechanics of Learning : Generalization , 2002 .

[18] Tengyuan Liang,et al. Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[19] Meir Feder,et al. A New Look at an Old Problem: A Universal Learning Approach to Linear Regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[20] André Longtin,et al. Learning to generalize , 2019, eLife.

[21] Christos Thrampoulidis,et al. A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, ArXiv.

[22] Mikhail Belkin,et al. Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..