Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior

We describe an approach to understand the peculiar and counterintuitive generalization properties of deep neural networks. The approach involves going beyond worst-case theoretical capacity control frameworks that have been popular in machine learning in recent years to revisit old ideas in the statistical mechanics of neural networks. Within this approach, we present a prototypical Very Simple Deep Learning (VSDL) model, whose behavior is controlled by two control parameters, one describing an effective amount of data, or load, on the network (that decreases when noise is added to the input), and one with an effective temperature interpretation (that increases when algorithms are early stopped). Using this model, we describe how a very simple application of ideas from the statistical mechanics theory of generalization provides a strong qualitative description of recently-observed empirical results regarding the inability of deep neural networks not to overfit training data, discontinuous learning and sharp transitions in the generalization properties of learning algorithms, etc.

[1]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[2]  W. Little The existence of persistent states in the brain , 1974 .

[3]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[4]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[6]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[7]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[8]  P. Carnevali,et al.  Exhaustive Thermodynamical Analysis of Boolean Learning Networks , 1987 .

[9]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[10]  E. Gardner,et al.  Three unfinished works on the optimal storage capacity of networks , 1989 .

[11]  Györgyi,et al.  First-order transition to perfect generalization in a neural network with binary synapses. , 1990, Physical review. A, Atomic, molecular, and optical physics.

[12]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[13]  Kanter,et al.  Statistical mechanics of a multilayered neural network. , 1990, Physical review letters.

[14]  Sompolinsky,et al.  Learning from examples in large neural networks. , 1990, Physical review letters.

[15]  Yann LeCun,et al.  Constrained neural networks for pattern recognition , 1991 .

[16]  D. Hansel,et al.  Memorization without generalization in a multilayered neural network , 1992 .

[17]  Albrecht Rau,et al.  Statistical mechanics of neural networks , 1992 .

[18]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[19]  John A. Hertz,et al.  Statistical Mechanics of Learning in a Large Committee Machine , 1992, NIPS.

[20]  Solla,et al.  Learning in linear neural networks: The validity of the annealed approximation. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[21]  Shun-ichi Amari,et al.  Four Types of Learning Curves , 1992, Neural Computation.

[22]  Gerald Tesauro,et al.  How Tight Are the Vapnik-Chervonenkis Bounds? , 1992, Neural Computation.

[23]  H. Schwarze Learning a rule in a multilayer neural network , 1993 .

[24]  T. Watkin,et al.  THE STATISTICAL-MECHANICS OF LEARNING A RULE , 1993 .

[25]  A. Engel,et al.  Statistical mechanics calculation of Vapnik-Chervonenkis bounds for perceptrons , 1993 .

[26]  Oh,et al.  Generalization in a two-layer neural network. , 1993, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[27]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[28]  Opper,et al.  Learning and generalization in a two-layer neural network: The role of the Vapnik-Chervonvenkis dimension. , 1994, Physical review letters.

[29]  A. Engel,et al.  Reliability of Replica Symmetry for the Generalization Problem of a Toy Multilayer Neural Network , 1994 .

[30]  Yann LeCun,et al.  Measuring the VC-Dimension of a Learning Machine , 1994, Neural Computation.

[31]  Y. Kabashima Perfect loss of generalization due to noise in K=2 parity machines , 1994 .

[32]  D. Haussler,et al.  Rigorous learning curve bounds from statistical mechanics , 1994, COLT '94.

[33]  Michael Biehl,et al.  On-line backpropagation in two-layered neural networks , 1995 .

[34]  Manfred OPPERInstitut Perceptron Learning: the Largest Version Space , 1995 .

[35]  N. Caticha,et al.  On-line learning in the committee machine , 1995 .

[36]  Van den Broeck C,et al.  Storage capacity and generalization error for the reversed-wedge Ising perceptron. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[37]  S. Kak Information, physics, and computation , 1996 .

[38]  N. Caticha,et al.  On-line learning in parity machines , 1996 .

[39]  Kinouchi,et al.  Equivalence between learning in noisy perceptrons and tree committee machines. , 1996, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[40]  Peter L. Bartlett,et al.  For Valid Generalization the Size of the Weights is More Important than the Size of the Network , 1996, NIPS.

[41]  Klaus Schulten,et al.  A Numerical Study on Learning Curves in Stochastic Multilayer Feedforward Networks , 1996, Neural Computation.

[42]  Michael Biehl,et al.  Transient dynamics of on-line learning in two-layered neural networks , 1996 .

[43]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[44]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[45]  S. Bös STATISTICAL MECHANICS APPROACH TO EARLY STOPPING AND WEIGHT DECAY , 1998 .

[46]  M. Talagrand Replica symmetry breaking and exponential inequalities for the Sherrington-Kirkpatrick model , 2000 .

[47]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[48]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[49]  Andreas Engel,et al.  Complexity of learning in artificial neural networks , 2001, Theor. Comput. Sci..

[50]  V. Akila,et al.  Information , 2001, The Lancet.

[51]  E. Bolthausen,et al.  The Random Energy Model , 2002 .

[52]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[53]  Ronald,et al.  Learning representations by backpropagating errors , 2004 .

[54]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[55]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[56]  Michael W. Mahoney,et al.  Learning with Spectral Kernels and Heavy-Tailed Data , 2009, ArXiv.

[57]  Surya Ganguli,et al.  Statistical mechanics of compressed sensing. , 2010, Physical review letters.

[58]  Antonio Auffinger,et al.  Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[59]  Michael W. Mahoney,et al.  Implementing regularization implicitly via approximate eigenvector computation , 2010, ICML.

[60]  Michael W. Mahoney,et al.  Regularized Laplacian Estimation and Fast Eigenvector Approximation , 2011, NIPS.

[61]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[62]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[63]  Michael W. Mahoney Approximate computation and implicit regularization for very large-scale data analysis , 2012, PODS.

[64]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Y. Fyodorov High-Dimensional Random Fields and Random Matrix Theory , 2013, 1307.2379.

[66]  S. Ganguli,et al.  Statistical mechanics of complex neural systems and high dimensional data , 2013, 1301.7115.

[67]  Surya Ganguli,et al.  On the saddle point problem for non-convex optimization , 2014, ArXiv.

[68]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[69]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[70]  Léon Bottou,et al.  Making Vapnik–Chervonenkis Bounds Accurate , 2015 .

[71]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[72]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[73]  Yann LeCun,et al.  Explorations on high dimensional landscapes , 2014, ICLR.

[74]  Florent Krzakala,et al.  Statistical physics of inference: thresholds and algorithms , 2015, ArXiv.

[75]  Stefano Soatto,et al.  On the energy landscape of deep networks , 2015, 1511.06485.

[76]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[77]  P. Chaudhari,et al.  The Effect of Gradient Noise on the Energy Landscape of Deep Networks , 2015 .

[78]  Jonathan Krause,et al.  The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition , 2015, ECCV.

[79]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[80]  Surya Ganguli,et al.  Statistical Mechanics of Optimal Convex Inference in High Dimensions , 2016 .

[81]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[83]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[84]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[85]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[86]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[87]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[88]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[89]  Mario Michael Krell,et al.  A Capacity Scaling Law for Artificial Neural Networks , 2017, ArXiv.

[90]  Joseph Gonzalez,et al.  On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent , 2018, ArXiv.

[91]  Tomaso A. Poggio,et al.  Theory IIIb: Generalization in Deep Networks , 2018, ArXiv.

[92]  Tomaso A. Poggio,et al.  A Surprising Linear Relationship Predicts Test Performance in Deep Networks , 2018, ArXiv.

[93]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[94]  M. Kearns ON THE CONSEQUENCES OF THE STATISTICAL MECHANICS THEORY OF LEARNING CURVES FOR THE MODEL SELECTION PROBLEM , .