How rotational invariance of common kernels prevents generalization in high dimensions

Kernel ridge regression is well-known to achieve minimax optimal rates in low-dimensional settings. However, its behavior in high dimensions is much less understood. Recent work establishes consistency for high-dimensional kernel regression for a number of specific assumptions on the data distribution. In this paper, we show that in high dimensions, the rotational invariance property of commonly studied kernels (such as RBF, inner product kernels and fully-connected NTK of any depth) leads to inconsistent estimation unless the ground truth is a low-degree polynomial. Our lower bound on the generalization error holds for a wide range of distributions and kernels with different eigenvalue decays. This lower bound suggests that consistency results for kernel ridge regression in high dimensions generally require a more refined analysis that depends on the structure of the kernel beyond its eigenvalue decay.

[1]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[2]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[3]  Kh. D. Ikramov,et al.  Conditionally definite matrices , 2000 .

[4]  Stefan Wager,et al.  High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.

[5]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[6]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[7]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[8]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[9]  C. Berg,et al.  Harmonic Analysis on Semigroups , 1984 .

[10]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..

[11]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[12]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[13]  R. Getoor,et al.  Some theorems on stable processes , 1960 .

[14]  Noureddine El Karoui,et al.  The spectrum of kernel random matrices , 2010, 1001.0492.

[15]  Andrea Montanari,et al.  Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration , 2021, Applied and Computational Harmonic Analysis.

[16]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[17]  David J. C. MacKay,et al.  BAYESIAN NON-LINEAR MODELING FOR THE PREDICTION COMPETITION , 1996 .

[18]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[19]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[20]  Ingo Steinwart,et al.  Consistency and robustness of kernel-based regression in convex risk minimization , 2007, 0709.0626.

[21]  G. Wahba Spline models for observational data , 1990 .

[22]  T. Driscoll,et al.  Interpolation in the limit of increasingly flat radial basis functions , 2002 .

[23]  Martin J. Wainwright,et al.  Kernel Feature Selection via Conditional Covariance Minimization , 2017, NIPS.

[24]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[25]  V. Milman,et al.  Asymptotic Theory Of Finite Dimensional Normed Spaces , 1986 .

[26]  Tengyuan Liang,et al.  On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, COLT.

[27]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[28]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[29]  Zhenyu Liao,et al.  Kernel regression in high dimension: Refined analysis beyond double descent , 2020, AISTATS.

[30]  Charles A. Micchelli,et al.  On Convergence of Flat Multivariate Interpolation by Translation Kernels with Finite Smoothness , 2014 .

[31]  Jonathan Ragan-Kelley,et al.  Neural Kernels Without Tangents , 2020, ICML.

[32]  Andrea Montanari,et al.  When do neural networks outperform kernel methods? , 2020, NeurIPS.

[33]  David Mease,et al.  Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers , 2015, J. Mach. Learn. Res..

[34]  Andrea Montanari,et al.  Learning with invariances in random features and kernel models , 2021, COLT.

[35]  B. Fornberg,et al.  Theoretical and computational aspects of multivariate interpolation with increasingly flat radial basis functions , 2003 .

[36]  M. Ledoux The concentration of measure phenomenon , 2001 .

[37]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[38]  Dan A. Simovici,et al.  Bayesian Learning , 2019, Variational Bayesian Learning Theory.

[39]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[40]  R. Basri,et al.  On the Similarity between the Laplace and Neural Tangent Kernels , 2020, NeurIPS.

[41]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[42]  R. Schaback Multivariate Interpolation by Polynomials and Radial Basis Functions , 2005 .