Distributional Generalization: A New Kind of Generalization

We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to close in just their average error. For example, if we mislabel 30% of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the *test set* as well, while leaving other classes unaffected. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain. Our formal conjectures, which are much more general than this example, characterize the form of distributional generalization that can be expected in terms of problem parameters: model architecture, training procedure, number of samples, and data distribution. We give empirical evidence for these conjectures across a variety of domains in machine learning, including neural networks, kernel machines, and decision trees. Our results thus advance our empirical understanding of interpolating classifiers.

[1]  David Mease,et al.  Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers , 2015, J. Mach. Learn. Res..

[2]  Arvind Satyanarayan,et al.  The Building Blocks of Interpretability , 2018 .

[3]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[4]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[5]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[6]  E. Nadaraya On Estimating Regression , 1964 .

[7]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[8]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[9]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[10]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[11]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[12]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[13]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[14]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[15]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[16]  Ioannis Mitliagkas,et al.  A Modern Take on the Bias-Variance Tradeoff in Neural Networks , 2018, ArXiv.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[19]  Jeff A. Bilmes,et al.  Combating Label Noise in Deep Learning Using Abstention , 2019, ICML.

[20]  Abraham J. Wyner,et al.  Making Sense of Random Forest Probabilities: a Kernel Perspective , 2018, ArXiv.

[21]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[22]  Jonathan Ragan-Kelley,et al.  Neural Kernels Without Tangents , 2020, ICML.

[23]  S. Athey,et al.  Generalized random forests , 2016, The Annals of Statistics.

[24]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[25]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[26]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[27]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[28]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[29]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[30]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[31]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[32]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[33]  Matus Telgarsky,et al.  Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2020, ICLR.

[34]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[35]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[36]  Balaji Lakshminarayanan,et al.  Deep Ensembles: A Loss Landscape Perspective , 2019, ArXiv.

[37]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[38]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[39]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[40]  Hariharan Narayanan,et al.  Sample Complexity of Testing the Manifold Hypothesis , 2010, NIPS.

[41]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[43]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[44]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[45]  Jared Kaplan,et al.  A Neural Scaling Law from the Dimension of the Data Manifold , 2020, ArXiv.

[46]  Guy Gur-Ari,et al.  Wider Networks Learn Better Features , 2019, ArXiv.

[47]  Florent Krzakala,et al.  Generalisation dynamics of online learning in over-parameterised neural networks , 2019, ArXiv.

[48]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[49]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[50]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[51]  Guy N. Rothblum,et al.  Multicalibration: Calibration for the (Computationally-Identifiable) Masses , 2018, ICML.

[52]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[53]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[54]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[55]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[56]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[57]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[58]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[59]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[60]  Travis E. Oliphant,et al.  Guide to NumPy , 2015 .

[61]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[62]  Ruslan Salakhutdinov,et al.  Learning Not to Learn in the Presence of Noisy Labels , 2020, ArXiv.

[63]  Florent Krzakala,et al.  Generalisation error in learning with random features and the hidden manifold model , 2020, ICML.

[64]  Fábio Ferreira,et al.  Conditional Density Estimation with Neural Networks: Best Practices and Benchmarks , 2019, ArXiv.

[65]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[66]  Rashidedin Jahandideh,et al.  Physical Attribute Prediction Using Deep Residual Neural Networks , 2018, ArXiv.

[67]  Levent Sagun,et al.  The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[68]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[69]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[70]  Srini Narayanan,et al.  Stiffness: A New Perspective on Generalization in Neural Networks , 2019, ArXiv.

[71]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[72]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[73]  Robert E. Schapire,et al.  Theoretical Views of Boosting , 1999, EuroCOLT.

[74]  Carmela Troncoso,et al.  Disparate Vulnerability: on the Unfairness of Privacy Attacks Against Machine Learning , 2019, ArXiv.

[75]  L. Breiman Reflections After Refereeing Papers for NIPS , 2018 .

[76]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[77]  Ann B. Lee,et al.  RFCDE: Random Forests for Conditional Density Estimation , 2018, ArXiv.

[78]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[80]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[81]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[82]  Bolei Zhou,et al.  Understanding the role of individual units in a deep neural network , 2020, Proceedings of the National Academy of Sciences.

[83]  Philip M. Long,et al.  Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , 2020, ArXiv.

[84]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.