Generalisation dynamics of online learning in over-parameterised neural networks

Deep neural networks achieve stellar generalisation on a variety of problems, despite often being large enough to easily fit all their training data. Here we study the generalisation dynamics of two-layer neural networks in a teacher-student setup, where one network, the student, is trained using stochastic gradient descent (SGD) on data generated by another network, called the teacher. We show how for this problem, the dynamics of SGD are captured by a set of differential equations. In particular, we demonstrate analytically that the generalisation error of the student increases linearly with the network size, with other relevant parameters held constant. Our results indicate that achieving good generalisation in neural networks depends on the interplay of at least the algorithm, its learning rate, the model architecture, and the data set.

[1]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[2]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[3]  Hong Hu,et al.  A Solvable High-Dimensional Model of GAN , 2018, NeurIPS.

[4]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[5]  Yann LeCun,et al.  Comparing dynamics: deep neural networks versus glassy systems , 2018, ICML.

[6]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[7]  Grant M. Rotskoff,et al.  Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks , 2018, NeurIPS.

[8]  Nicolas Macris,et al.  The committee machine: computational to statistical gaps in learning a two-layers neural network , 2018, NeurIPS.

[9]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[10]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[11]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[12]  T. Watkin,et al.  THE STATISTICAL-MECHANICS OF LEARNING A RULE , 1993 .

[13]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[14]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[15]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[16]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[17]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[18]  Wolfgang Kinzel,et al.  Improving a Network Generalization Ability by Selecting Examples , 1990 .

[19]  Michael Biehl,et al.  Learning by on-line gradient descent , 1995 .

[20]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[21]  Anthony C. C. Coolen,et al.  Statistical mechanical analysis of the dynamics of learning in perceptrons , 1997, Stat. Comput..

[22]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[23]  E. Gardner,et al.  Three unfinished works on the optimal storage capacity of networks , 1989 .

[24]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[25]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[26]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[27]  Florent Krzakala,et al.  Statistical physics of inference: thresholds and algorithms , 2015, ArXiv.

[28]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[29]  Saad,et al.  Exact solution for on-line learning in multilayer neural networks. , 1995, Physical review letters.

[30]  Yue M. Lu,et al.  Scaling Limit: Exact and Tractable Analysis of Online Learning Algorithms with Applications to Regularized Regression and PCA , 2017, ArXiv.

[31]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[32]  H. Schwarze Learning a rule in a multilayer neural network , 1993 .

[33]  Surya Ganguli,et al.  Statistical Mechanics of Optimal Convex Inference in High Dimensions , 2016 .

[34]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[35]  David Saad,et al.  Learning with Noise and Regularizers in Multilayer Neural Networks , 1996, NIPS.

[36]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[37]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[38]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[39]  J. Hertz,et al.  Generalization in a linear perceptron in the presence of noise , 1992 .

[40]  E. Oja,et al.  On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix , 1985 .

[41]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[42]  Surya Ganguli,et al.  An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.

[43]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[44]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.