Generalization Error without Independence: Denoising, Linear Regression, and Transfer Learning

Studying the generalization abilities of linear models with real data is a central question in statistical learning. While there exist a limited number of prior important works (Loureiro et al. (2021A, 2021B), Wei et al. 2022) that do validate theoretical work with real data, these works have limitations due to technical assumptions. These assumptions include having a well-conditioned covariance matrix and having independent and identically distributed data. These assumptions are not necessarily valid for real data. Additionally, prior works that do address distributional shifts usually make technical assumptions on the joint distribution of the train and test data (Tripuraneni et al. 2021, Wu and Xu 2020), and do not test on real data. In an attempt to address these issues and better model real data, we look at data that is not I.I.D. but has a low-rank structure. Further, we address distributional shift by decoupling assumptions on the training and test distribution. We provide analytical formulas for the generalization error of the denoising problem that are asymptotically exact. These are used to derive theoretical results for linear regression, data augmentation, principal component regression, and transfer learning. We validate all of our theoretical results on real data and have a low relative mean squared error of around 1% between the empirical risk and our estimated risk.

[1]  Bochao Gu,et al.  Under-Parameterized Double Descent for Ridge Regularized Least Squares Denoising of Data on a Line , 2023, ArXiv.

[2]  Reinhard Heckel,et al.  Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization , 2022, ArXiv.

[3]  Jeffrey Pennington,et al.  Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions , 2022, NeurIPS.

[4]  Reinhard Heckel,et al.  Regularization-wise double descent: Why it occurs and how to eliminate it , 2022, 2022 IEEE International Symposium on Information Theory (ISIT).

[5]  Jeffrey Pennington,et al.  Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties , 2022, 2205.07069.

[6]  J. Steinhardt,et al.  More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize , 2022, ICML.

[7]  O. Shamir,et al.  The Implicit Bias of Benign Overfitting , 2022, COLT.

[8]  Guido Montufar,et al.  Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks , 2022, ICLR.

[9]  Jeffrey Pennington,et al.  Covariate Shift in High-Dimensional Random Feature Regression , 2021, ArXiv.

[10]  Yair Carmon,et al.  Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization , 2021, ICML.

[11]  Qi Lei,et al.  Near-Optimal Linear Regression under Distribution Shift , 2021, ICML.

[12]  F. Krzakala,et al.  Learning Gaussian Mixtures with Generalised Linear Models: Precise Asymptotics in High-dimensions , 2021, NeurIPS.

[13]  Florent Krzakala,et al.  Learning curves of generic features maps for realistic datasets with a teacher-student model , 2021, NeurIPS.

[14]  Mohammad Zalbagi Darestani,et al.  Measuring Robustness in Deep Learning Based Compressive Sensing , 2021, ICML.

[15]  Andrea Montanari,et al.  Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration , 2021, Applied and Computational Harmonic Analysis.

[16]  D. Hogg,et al.  Dimensionality reduction, regularization, and generalization in overparameterized regressions , 2020, SIAM Journal on Mathematics of Data Science.

[17]  Andrea Montanari,et al.  When do neural networks outperform kernel methods? , 2020, NeurIPS.

[18]  Ji Xu,et al.  On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression , 2020, NeurIPS.

[19]  Levent Sagun,et al.  Triple descent and the two kinds of overfitting: where and why do they appear? , 2020, NeurIPS.

[20]  Benjamin Recht,et al.  The Effect of Natural Distribution Shift on Question Answering Models , 2020, ICML.

[21]  Edgar Dobriban,et al.  The Implicit Regularization of Stochastic Gradient Flow for Least Squares , 2020, ICML.

[22]  Tengyu Ma,et al.  Optimal Regularization Can Mitigate Double Descent , 2020, ICLR.

[23]  Florent Krzakala,et al.  Generalisation error in learning with random features and the hidden manifold model , 2020, ICML.

[24]  Arthur Jacot,et al.  Implicit Regularization of Random Feature Models , 2020, ICML.

[25]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[26]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[27]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[28]  Michael W. Mahoney,et al.  Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[29]  R. Thomas McCoy,et al.  BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance , 2019, BLACKBOXNLP.

[30]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[31]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[32]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[33]  Andrea Montanari,et al.  Limitations of Lazy Training of Two-layers Neural Networks , 2019, NeurIPS.

[34]  Nathan Srebro,et al.  Kernel and Rich Regimes in Overparametrized Models , 2019, COLT.

[35]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[36]  Lucas Benigni,et al.  Eigenvalue distribution of some nonlinear models of random matrices , 2019, Electronic Journal of Probability.

[37]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[38]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[39]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[40]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[41]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[42]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[43]  J. Zico Kolter,et al.  A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.

[44]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[45]  Surya Ganguli,et al.  An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.

[46]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[47]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[48]  Behnam Neyshabur,et al.  Implicit Regularization in Deep Learning , 2017, ArXiv.

[49]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[50]  Madeleine Udell,et al.  Why Are Big Data Matrices Approximately Low Rank? , 2017, SIAM J. Math. Data Sci..

[51]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[52]  Stefan Wager,et al.  High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.

[53]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[54]  F. Gotze,et al.  On the Rate of Convergence to the Marchenko--Pastur Distribution , 2011, 1110.1284.

[55]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[56]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[57]  Friedrich Götze,et al.  The rate of convergence for spectra of GUE and LUE matrix ensembles , 2005 .

[58]  F. Götze,et al.  Rate of convergence in probability to the Marchenko-Pastur law , 2004 .

[59]  Friedrich Götze,et al.  Rate of convergence to the semi-circular law , 2003 .

[60]  Yimin Wei,et al.  The weighted Moore-Penrose inverse of modified matrices , 2001, Appl. Math. Comput..

[61]  A. Goldberger,et al.  On the Exact Covariance of Products of Random Variables , 1969 .

[62]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[63]  R. Nadakuditi,et al.  Training Data Size Induced Double Descent For Denoising Feedforward Neural Networks and the Role of Training Noise , 2023, Trans. Mach. Learn. Res..

[64]  Yunwen Lei,et al.  University of Birmingham Generalization performance of multi-pass stochastic gradient descent with convex loss functions , 2021 .

[65]  Nilesh Tripuraneni,et al.  Overparameterization Improves Robustness to Covariate Shift in High Dimensions , 2021, NeurIPS.

[66]  Surya Ganguli,et al.  A theory of high dimensional regression with arbitrary correlations between input features and target functions: sample complexity, multiple descent curves and a hierarchy of phase transitions , 2021, ICML.

[67]  Ji Xu,et al.  On the number of variables to use in principal component regression , 2019 .

[68]  S. Péché,et al.  A note on the Pennington-Worah distribution , 2019, Electronic Communications in Probability.

[69]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[70]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[71]  Michael Elad,et al.  Stable recovery of sparse overcomplete representations in the presence of noise , 2006, IEEE Transactions on Information Theory.

[72]  Jian-Feng Yao,et al.  Convergence Rates of Spectral Distributions of Large Sample Covariance Matrices , 2003, SIAM J. Matrix Anal. Appl..

[73]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .