Harmless interpolation in regression and classification with structured features

Overparametrized neural networks tend to perfectly fit noisy training data yet generalize well on test data. Inspired by this empirical observation, recent work has sought to understand this phenomenon of benign overfitting or harmless interpolation in the much simpler linear model. Previous theoretical work critically assumes that either the data features are statistically independent or the input data is high-dimensional; this precludes general nonparametric settings with structured feature maps. In this paper, we present a general and flexible framework for upper bounding regression and classification risk in a reproducing kernel Hilbert space. A key contribution is that our framework describes precise sufficient conditions on the data Gram matrix under which harmless interpolation occurs. Our results recover prior independent-features results (with a much simpler analysis), but they furthermore show that harmless interpolation can occur in more general settings such as features that are a bounded orthonormal system. Furthermore, our results show an asymptotic separation between classification and regression performance in a manner that was previously only shown for Gaussian features.

[1]  Richard G. Baraniuk,et al.  A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning , 2021, ArXiv.

[2]  Zhi-Hua Zhou,et al.  Towards an Understanding of Benign Overfitting in Neural Networks , 2021, ArXiv.

[3]  Mikhail Belkin,et al.  Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation , 2021, Acta Numerica.

[4]  Mikhail Belkin,et al.  Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures , 2021, NeurIPS.

[5]  Mingqi Wu,et al.  How rotational invariance of common kernels prevents generalization in high dimensions , 2021, ICML.

[6]  Andrea Montanari,et al.  Deep learning: a statistical viewpoint , 2021, Acta Numerica.

[7]  Andrea Montanari,et al.  Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration , 2021, Applied and Computational Harmonic Analysis.

[8]  Christos Thrampoulidis,et al.  Benign Overfitting in Binary Classification of Gaussian Mixtures , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jeffrey Pennington,et al.  Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition , 2020, NeurIPS.

[10]  Edgar Dobriban,et al.  What causes the test error? Going beyond bias-variance via ANOVA , 2020, J. Mach. Learn. Res..

[11]  P. Bartlett,et al.  Benign overfitting in ridge regression , 2020, J. Mach. Learn. Res..

[12]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[13]  Daniel J. Hsu,et al.  On the proliferation of support vectors in high dimensions , 2020, AISTATS.

[14]  Yue M. Lu,et al.  Universality Laws for High-Dimensional Learning With Random Features , 2020, IEEE Transactions on Information Theory.

[15]  Yue M. Lu,et al.  A Precise Performance Analysis of Learning with Random Features , 2020, ArXiv.

[16]  Justin Romberg,et al.  Sample complexity and effective dimension for regression on manifolds , 2020, NeurIPS.

[17]  Mikhail Belkin,et al.  Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks , 2020, ICLR.

[18]  Michael W. Mahoney,et al.  A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent , 2020, NeurIPS.

[19]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..

[20]  Murat A. Erdogdu,et al.  Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.

[21]  Philip M. Long,et al.  Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , 2020, J. Mach. Learn. Res..

[22]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[23]  G. Biroli,et al.  Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , 2020, ICML.

[24]  Florent Krzakala,et al.  Generalisation error in learning with random features and the hidden manifold model , 2020, ICML.

[25]  Tengyuan Liang,et al.  On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, COLT.

[26]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[27]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[28]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[29]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[30]  T. Hastie,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[31]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[32]  Alexander Rakhlin,et al.  Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon , 2018, COLT.

[33]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[34]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[35]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[36]  Mikhail Belkin,et al.  Approximation beats concentration? An approximation view on inference with smooth radial kernels , 2018, COLT.

[37]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[38]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[39]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013, 1306.2872.

[40]  Ingo Steinwart,et al.  Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs , 2012 .

[41]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[42]  V. Koltchinskii,et al.  High Dimensional Probability , 2006, math/0612726.

[43]  Tong Zhang,et al.  Learning Bounds for Kernel Regression Using Effective Data Dimensionality , 2005, Neural Computation.

[44]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2005, 0708.2321.

[45]  Vladimir Koltchinskii,et al.  Exponential Convergence Rates in Classification , 2005, COLT.

[46]  Beyond—bernhard Schölkopf,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[47]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[48]  Charles R. Johnson,et al.  Matrix analysis , 1985 .

[49]  Christian Reimers Understanding deep learning , 2023 .

[50]  W. Hager,et al.  and s , 2019, Shallow Water Hydraulics.

[51]  M. Urner Scattered Data Approximation , 2016 .

[52]  W. Marsden I and J , 2012 .

[53]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[54]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[55]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[56]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[57]  L. Ryd,et al.  On bias. , 1994, Acta orthopaedica Scandinavica.

[58]  and as an in , 2022 .