The Deep Bootstrap: Good Online Learners are Good Offline Generalizers

We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning. We then give empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification. For example, CNNs generalize better than MLPs on image distributions in the Real World, but this is "because" they optimize faster on the population loss in the Ideal World. This suggests our framework is a useful tool for understanding generalization in deep learning, and lays a foundation for future research in the area.

[1]  Wing Shing Wong,et al.  An Optimal Algorithm for Online Non-Convex Learning , 2018, SIGMETRICS.

[2]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[3]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[4]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[5]  Colin Wei,et al.  Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin , 2020, ICLR.

[6]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[7]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[8]  Abbas Mehrabian,et al.  Nearly-tight VC-dimension bounds for piecewise linear neural networks , 2017, COLT.

[9]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[10]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[11]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[12]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[13]  Simon Osindero,et al.  Small Data, Big Decisions: Model Selection in the Small-Data Regime , 2020, ICML.

[14]  Chuang Gan,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, ArXiv.

[15]  Yi Yang,et al.  NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search , 2020, ICLR.

[16]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[17]  Philip M. Long,et al.  Generalization bounds for deep convolutional neural networks , 2019, ICLR.

[18]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[19]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[20]  Prateek Jain,et al.  Non-convex Optimization for Machine Learning , 2017, Found. Trends Mach. Learn..

[21]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[22]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[23]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[24]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[25]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[26]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[27]  Matthias Bethge,et al.  Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet , 2019, ICLR.

[28]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[29]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[30]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[31]  Sébastien Bubeck,et al.  Introduction to Online Optimization , 2011 .

[32]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[33]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[34]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[35]  Ethan Dyer,et al.  Affinity and Diversity: Quantifying Mechanisms of Data Augmentation , 2020, ArXiv.

[36]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[37]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[38]  Xiaobo Li,et al.  Online Learning with Non-Convex Losses and Non-Stationary Regret , 2018, AISTATS.

[39]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[41]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[42]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[43]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[44]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[46]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[47]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[50]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[53]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[54]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[55]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[56]  Behnam Neyshabur,et al.  Towards Learning Convolutions from Scratch , 2020, NeurIPS.

[57]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[58]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ArXiv.

[59]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[60]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[61]  Rémi Munos,et al.  Online Learning in Adversarial Lipschitz Environments , 2010, ECML/PKDD.

[62]  Jonathan S. Rosenfeld,et al.  A Constructive Prediction of the Generalization Error Across Scales , 2020, ICLR.

[63]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[65]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[66]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[67]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[68]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[69]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[70]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[71]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[72]  Peter L. Bartlett,et al.  For Valid Generalization the Size of the Weights is More Important than the Size of the Network , 1996, NIPS.

[73]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[74]  Preetum Nakkiran,et al.  Distributional Generalization: A New Kind of Generalization , 2020, ArXiv.

[75]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[76]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[77]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[78]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[79]  André Elisseeff,et al.  Algorithmic Stability and Generalization Performance , 2000, NIPS.

[80]  Peter L. Bartlett,et al.  Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks , 1998, Neural Computation.