Kernel regression in high dimension: Refined analysis beyond double descent

In this paper, we provide a precise characterize of generalization properties of high dimensional kernel ridge regression across the under- and over-parameterized regimes, depending on whether the number of training data $n$ exceeds the feature dimension $d$. By establishing a novel bias-variance decomposition of the expected excess risk, we show that, while the bias is independent of $d$ and monotonically decreases with $n$, the variance depends on $n,d$ and can be unimodal or monotonically decreasing under different regularization schemes. Our refined analysis goes beyond the double descent theory by showing that, depending on the data eigen-profile and the level of regularization, the kernel regression risk curve can be a double-descent-like, bell-shaped, or monotonic function of $n$. Experiments on synthetic and real data are conducted to support our theoretical findings.

[1]  Zhu Li,et al.  Towards a Unified Analysis of Random Fourier Features , 2018, ICML.

[2]  Cheng Wang,et al.  Optimal learning rates for least squares regularized regression with unbounded sampling , 2011, J. Complex..

[3]  Tengyu Ma,et al.  Optimal Regularization Can Mitigate Double Descent , 2020, ICLR.

[4]  M. Lerasle,et al.  Benign overfitting in the large deviation regime , 2020, 2003.05838.

[5]  Yi Ma,et al.  Rethinking Bias-Variance Trade-off for Generalization of Neural Networks , 2020, ICML.

[6]  Mohamed-Slim Alouini,et al.  Risk Convergence of Centered Kernel Ridge Regression with Large Dimensional Data , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[8]  S. Smale,et al.  Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[9]  Noureddine El Karoui,et al.  The spectrum of kernel random matrices , 2010, 1001.0492.

[10]  Ingo Steinwart,et al.  Sobolev Norm Learning Rates for Regularized Least-Squares Algorithms , 2017, J. Mach. Learn. Res..

[11]  Gilles Blanchard,et al.  Optimal learning rates for Kernel Conjugate Gradient regression , 2010, NIPS.

[12]  Arthur Jacot,et al.  Kernel Alignment Risk Estimator: Risk Prediction from Training Data , 2020, NeurIPS.

[13]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[14]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .

[15]  J. Zico Kolter,et al.  A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.

[16]  Lorenzo Rosasco,et al.  Elastic-net regularization in learning theory , 2008, J. Complex..

[17]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[18]  Dmitry Kobak,et al.  The Optimal Ridge Penalty for Real-world High-dimensional Data Can Be Zero or Negative due to the Implicit Ridge Regularization , 2020, J. Mach. Learn. Res..

[19]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[20]  Stefan Wager,et al.  High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.

[21]  Ameya Velingker,et al.  Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees , 2018, ICML.

[22]  Lorenzo Rosasco,et al.  On the Sample Complexity of Subspace Learning , 2013, NIPS.

[23]  Richard G. Baraniuk,et al.  The Implicit Regularization of Ordinary Least Squares Ensembles , 2020, AISTATS.

[24]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[25]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[26]  Blake Bordelon,et al.  Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks , 2020, ICML.

[27]  Ding-Xuan Zhou,et al.  Distributed Learning with Regularized Least Squares , 2016, J. Mach. Learn. Res..

[28]  Lorenzo Rosasco,et al.  Interpolation and Learning with Scale Dependent Kernels , 2020, ArXiv.

[29]  Lorenzo Rosasco,et al.  Asymptotics of Ridge(less) Regression under General Source Condition , 2020, AISTATS.

[30]  Ding-Xuan Zhou,et al.  Learning Theory: An Approximation Theory Viewpoint , 2007 .

[31]  Stephane Chretien,et al.  A finite sample analysis of the double descent phenomenon for ridge function estimation , 2020, ArXiv.

[32]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[33]  Michael W. Mahoney,et al.  Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[34]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[35]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[36]  Weilin Li Generalization error of minimum weighted norm and kernel interpolation , 2020, ArXiv.

[37]  Tengyuan Liang,et al.  On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, COLT.

[38]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[39]  Lei Shi,et al.  Learning Theory of Distributed Regression with Bias Corrected Regularization Kernel Network , 2017, J. Mach. Learn. Res..

[40]  Nello Cristianini,et al.  On the Eigenspectrum of the Gram Matrix and Its Relationship to the Operator Eigenspectrum , 2002, ALT.

[41]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[42]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[43]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[44]  Martin J. Wainwright,et al.  Divide and Conquer Kernel Ridge Regression , 2013, COLT.

[45]  Blake Bordelon,et al.  Spectral Bias and Task-Model Alignment Explain Generalization in Kernel Regression and Infinitely Wide Neural Networks , 2020 .

[46]  Ingo Steinwart,et al.  Fast rates for support vector machines using Gaussian kernels , 2007, 0708.1838.

[47]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[48]  Mohamed-Slim Alouini,et al.  Risk Convergence of Centered Kernel Ridge Regression With Large Dimensional Data , 2019, IEEE Transactions on Signal Processing.

[49]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.