Dimensionality reduction, regularization, and generalization in overparameterized regressions

Overparameterization in deep learning is powerful: Very large models fit the training data perfectly and yet generalize well. This realization brought back the study of linear models for regression, including ordinary least squares (OLS), which, like deep learning, shows a "double descent" behavior. This involves two features: (1) The risk (out-of-sample prediction error) can grow arbitrarily when the number of samples $n$ approaches the number of parameters $p$, and (2) the risk decreases with $p$ at $p>n$, sometimes achieving a lower value than the lowest risk at $p<n$. The divergence of the risk for OLS at $p\approx n$ is related to the condition number of the empirical covariance in the feature set. For this reason, it can be avoided with regularization. In this work we show that it can also be avoided with a PCA-based dimensionality reduction. We provide a finite upper bound for the risk of the PCA-based estimator. This result is in contrast to recent work that shows that a different form of dimensionality reduction -- one based on the population covariance instead of the empirical covariance -- does not avoid the divergence. We connect these results to an analysis of adversarial attacks, which become more effective as they raise the condition number of the empirical covariance of the features. We show that OLS is arbitrarily susceptible to data-poisoning attacks in the overparameterized regime -- unlike the underparameterized regime -- and that regularization and dimensionality reduction improve the robustness.

[1]  Yingzhen Li,et al.  Are Generative Classifiers More Robust to Adversarial Attacks? , 2018, ICML.

[2]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[3]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[4]  Andrew Gordon Wilson,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[5]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[6]  Daniel L. Pimentel-Alarcón,et al.  Adversarial principal component analysis , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[7]  Adel Javanmard,et al.  Precise Tradeoffs in Adversarial Training for Linear Regression , 2020, COLT.

[8]  Jonathan Rougier,et al.  The Exact Form of the “Ockham Factor” in Model Selection , 2019, The American Statistician.

[9]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[10]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[11]  Tony Jebara,et al.  Machine learning: Discriminative and generative , 2006 .

[12]  Levent Sagun,et al.  Triple descent and the two kinds of overfitting: where and why do they appear? , 2020, NeurIPS.

[13]  V. Koltchinskii,et al.  Concentration inequalities and moment bounds for sample covariance operators , 2014, 1405.2468.

[14]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[15]  Zhenyu Liao,et al.  Kernel regression in high dimension: Refined analysis beyond double descent , 2020, ArXiv.

[16]  Tengyu Ma,et al.  Optimal Regularization Can Mitigate Double Descent , 2020, ICLR.

[17]  Xiaojun Lin,et al.  Overfitting Can Be Harmless for Basis Pursuit: Only to a Degree , 2020, ArXiv.

[18]  Shuguang Cui,et al.  Optimal Feature Manipulation Attacks Against Linear Regression , 2020, IEEE Transactions on Signal Processing.

[19]  Levent Sagun,et al.  A jamming transition from under- to over-parametrization affects generalization in deep learning , 2018, Journal of Physics A: Mathematical and Theoretical.

[20]  Ji Xu,et al.  On the number of variables to use in principal component regression , 2019 .

[21]  Chuanlong Xie,et al.  Provable More Data Hurt in High Dimensional Least Squares Estimator , 2020, ArXiv.

[22]  Yi Ma,et al.  Rethinking Bias-Variance Trade-off for Generalization of Neural Networks , 2020, ICML.

[23]  Blaine Nelson,et al.  Poisoning Attacks against Support Vector Machines , 2012, ICML.

[24]  Sham M. Kakade,et al.  A risk comparison of ordinary least squares vs ridge regression , 2011, J. Mach. Learn. Res..

[25]  Andreas Loukas,et al.  How Close Are the Eigenvectors of the Sample and Actual Covariance Matrices? , 2017, ICML.

[26]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[27]  Anna Y. Q. Ho,et al.  THE CANNON: A DATA-DRIVEN APPROACH TO STELLAR LABEL DETERMINATION , 2015, 1501.07604.

[28]  Florent Krzakala,et al.  Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , 2020, ICML.

[29]  Levent Sagun,et al.  The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[30]  David W. Hogg,et al.  Fitting Very Flexible Models: Linear Regression With Large Numbers of Parameters , 2021, Publications of the Astronomical Society of the Pacific.

[31]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[32]  Blake Bordelon,et al.  Spectral Bias and Task-Model Alignment Explain Generalization in Kernel Regression and Infinitely Wide Neural Networks , 2020 .

[33]  Holger Rauhut,et al.  Weighted Optimization: better generalization by smoother interpolation , 2020, ArXiv.

[34]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[35]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Xiaojin Zhu,et al.  Using Machine Teaching to Identify Optimal Training-Set Attacks on Machine Learners , 2015, AAAI.

[37]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.