Relative Flatness and Generalization in the Interpolation Regime.

Traditional generalization bounds are based on analyzing the limits of the model capacity. Therefore, they become vacuous in the \emph{interpolation} (over-parameterized) regime of modern machine learning models where training data can be fitted perfectly. This paper proposes a new approach to meaningful generalization bounds in the interpolation regime by decomposing the generalization gap into a notion of \emph{representativeness} and \emph{feature robustness}. Representativeness captures properties of the data distribution and mitigates the dependence on the data dimension by exploiting the low-dimensional feature representation used implicitly by the model, and feature robustness captures the expected change in loss resulting from perturbations of these implicit features. We show that feature robustness can be bounded by a relative flatness measure of the empirical loss surface for models that locally minimize the training loss. This yields an algorithm-agnostic bound potentially explaining the abundance of empirical observations that flatness of the loss surface is correlated with generalization.

[1]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[2]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[3]  Davide Anguita,et al.  Global Rademacher Complexity Bounds: From Slow to Fast Convergence Rates , 2015, Neural Processing Letters.

[4]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[5]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[6]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[7]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[8]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[9]  Jürgen Schmidhuber,et al.  Simplifying Neural Nets by Discovering Flat Minima , 1994, NIPS.

[10]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[11]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[12]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[13]  Konstantinos Pitas,et al.  Better PAC-Bayes Bounds for Deep Neural Networks using the Loss Curvature , 2019, ArXiv.

[14]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[15]  Jascha Sohl-Dickstein,et al.  Sensitivity and Generalization in Neural Networks: an Empirical Study , 2018, ICLR.

[16]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .

[17]  John Langford,et al.  (Not) Bounding the True Error , 2001, NIPS.

[18]  Ryan P. Adams,et al.  Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , 2018, ICLR.

[19]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[20]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[21]  M. C. Jones,et al.  Variable location and scale kernel density estimation , 1994 .

[22]  Masashi Sugiyama,et al.  Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis , 2019, ICML.

[23]  George Em Karniadakis,et al.  Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness , 2019, Neural Networks.

[24]  H. Whitney Geometric Integration Theory , 1957 .

[25]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[26]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[27]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[28]  Tomaso A. Poggio,et al.  Theory of Deep Learning IIb: Optimization Properties of SGD , 2018, ArXiv.

[29]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[30]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[31]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[32]  Davide Anguita,et al.  A local Vapnik-Chervonenkis complexity , 2016, Neural Networks.

[33]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[34]  Trac D. Tran,et al.  A Scale Invariant Flatness Measure for Deep Network Minima , 2019, ArXiv.

[35]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[36]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[37]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[38]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[39]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[40]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[41]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[42]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[43]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[44]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[45]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[46]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[47]  Huan Wang,et al.  Identifying Generalization Properties in Neural Networks , 2018, ArXiv.

[48]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[49]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[50]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.