Out-of-Distribution Generalization in Kernel Regression

In real word applications, the data generating process for training a machine learning model often differs from what the model encounters in the test stage. Understanding how and whether machine learning models generalize under such distributional shifts remains a theoretical challenge. Here, we study generalization in kernel regression when the training and test distributions are different using the replica method from statistical physics. We derive an analytical formula for the out-of-distribution generalization error applicable to any kernel and real datasets. We identify an overlap matrix that quantifies the mismatch between distributions for a given kernel as a key determinant of generalization performance under distribution shift. Using our analytical expressions we elucidate various generalization phenomena including possible improvement in generalization when there is a mismatch. We develop procedures for optimizing training and test distributions for a given data budget to find best and worst case generalizations under the shift. We present applications of our theory to real and synthetic datasets and for many kernels. We compare results of our theory applied to Neural Tangent Kernel with simulations of wide networks and show agreement. We analyze linear regression in further depth.

[1]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[2]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[3]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[4]  Percy Liang,et al.  Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , 2019, ArXiv.

[5]  M. Opper,et al.  Statistical mechanics of Support Vector networks. , 1998, cond-mat/9811421.

[6]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[7]  Julien Mairal,et al.  On the Inductive Bias of Neural Tangent Kernels , 2019, NeurIPS.

[8]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[9]  Jaehoon Lee,et al.  Neural Tangents: Fast and Easy Infinite Neural Networks in Python , 2019, ICLR.

[10]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[11]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[12]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[13]  Kouichi Sakurai,et al.  One Pixel Attack for Fooling Deep Neural Networks , 2017, IEEE Transactions on Evolutionary Computation.

[14]  Tengyu Ma,et al.  Optimal Regularization Can Mitigate Double Descent , 2021, ICLR.

[15]  Aleksander Madry,et al.  Robustness May Be at Odds with Accuracy , 2018, ICLR.

[16]  Amin Karbasi,et al.  Multiple Descent: Design Your Own Generalization Curve , 2020, NeurIPS.

[17]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[18]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[19]  S. Ganguli,et al.  Statistical mechanics of complex neural systems and high dimensional data , 2013, 1301.7115.

[20]  Blake Bordelon,et al.  Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks , 2020, ICML.

[21]  Wouter M. Kouw,et al.  A Review of Domain Adaptation without Target Labels , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Tengyuan Liang,et al.  On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, COLT.

[23]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[24]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[25]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[26]  Aleksander Madry,et al.  Exploring the Landscape of Spatial Robustness , 2017, ICML.

[27]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[28]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[29]  Yaser S. Abu-Mostafa,et al.  Mismatched Training and Test Distributions Can Outperform Matched Ones , 2015, Neural Computation.

[30]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[31]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[32]  Levent Sagun,et al.  Triple descent and the two kinds of overfitting: where and why do they appear? , 2020, NeurIPS.

[33]  A. Wald Statistical Decision Functions Which Minimize the Maximum Risk , 1945 .

[34]  Preetum Nakkiran,et al.  More Data Can Hurt for Linear Regression: Sample-wise Double Descent , 2019, ArXiv.

[35]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[36]  M. Opper,et al.  On the ability of the optimal perceptron to generalise , 1990 .

[37]  Ievgen Redko,et al.  Advances in Domain Adaptation Theory , 2019 .

[38]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[39]  Martin Arjovsky Out of Distribution Generalization in Machine Learning , 2021, ArXiv.

[40]  Ken-ichi Kawarabayashi,et al.  How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks , 2020, ICLR.

[41]  Percy Liang,et al.  An Investigation of Why Overparameterization Exacerbates Spurious Correlations , 2020, ICML.

[42]  Yuan Xu,et al.  Approximation Theory and Harmonic Analysis on Spheres and Balls , 2013 .

[43]  Aaron C. Courville,et al.  Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[44]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[45]  J. Hertz,et al.  Generalization in a linear perceptron in the presence of noise , 1992 .

[46]  Blake Bordelon,et al.  Spectral Bias and Task-Model Alignment Explain Generalization in Kernel Regression and Infinitely Wide Neural Networks , 2020 .

[47]  Florent Krzakala,et al.  Capturing the learning curves of generic features maps for realistic data sets with a teacher-student model , 2021, ArXiv.

[48]  A. Cavagna,et al.  Spin-glass theory for pedestrians , 2005, cond-mat/0505032.

[49]  M. Mézard,et al.  Spin Glass Theory And Beyond: An Introduction To The Replica Method And Its Applications , 1986 .

[50]  Florent Krzakala,et al.  Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , 2020, ICML.

[51]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[52]  G. Wahba Spline models for observational data , 1990 .

[53]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[54]  J. Hertz,et al.  Phase transitions in simple learning , 1989 .