Interpolation and Learning with Scale Dependent Kernels

We study the learning properties of nonparametric ridge-less least squares. In particular, we consider the common case of estimators defined by scale dependent kernels, and focus on the role of the scale. These estimators interpolate the data and the scale can be shown to control their stability through the condition number. Our analysis shows that are different regimes depending on the interplay between the sample size, its dimensions, and the smoothness of the problem. Indeed, when the sample size is less than exponential in the data dimension, then the scale can be chosen so that the learning error decreases. As the sample size becomes larger, the overall error stop decreasing but interestingly the scale can be chosen in such a way that the variance due to noise remains bounded. Our analysis combines, probabilistic results with a number of analytic techniques from interpolation theory.

[1]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[2]  Gilles Blanchard,et al.  Optimal Rates for Regularization of Statistical Inverse Learning Problems , 2016, Found. Comput. Math..

[3]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[4]  Lorenzo Rosasco,et al.  On regularization algorithms in learning theory , 2007, J. Complex..

[5]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[6]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[7]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[8]  Weifeng Liu,et al.  Adaptive and Learning Systems for Signal Processing, Communication, and Control , 2010 .

[9]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[10]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[11]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[12]  Ingo Steinwart,et al.  A closer look at covering number bounds for Gaussian kernels , 2019, J. Complex..

[13]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[14]  Lorenzo Rosasco,et al.  Elastic-net regularization in learning theory , 2008, J. Complex..

[15]  Alexander Rakhlin,et al.  Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon , 2018, COLT.

[16]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[17]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[18]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[19]  J. T. Spooner,et al.  Adaptive and Learning Systems for Signal Processing , Communications , and Control , 2013 .

[20]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[22]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[23]  Mikhail Belkin,et al.  Approximation beats concentration? An approximation view on inference with smooth radial kernels , 2018, COLT.

[24]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[25]  Lorenzo Rosasco,et al.  On the Sample Complexity of Subspace Learning , 2013, NIPS.

[26]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[27]  Armin Iske,et al.  Improved estimates for condition numbers of radial basis function interpolation matrices , 2017, J. Approx. Theory.

[28]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[29]  Tengyuan Liang,et al.  On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, ArXiv.

[30]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[31]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[32]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[33]  Ingo Steinwart,et al.  Optimal regression rates for SVMs using Gaussian kernels , 2013 .

[34]  Ding-Xuan Zhou,et al.  Learning Theory: An Approximation Theory Viewpoint , 2007 .

[35]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.