Comparing dynamics: deep neural networks versus glassy systems

We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are (1) the complexity of the loss landscape and of the dynamics within it, and (2) to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. At large times, when the loss is approaching zero, the system diffuses at the bottom of the landscape. Despite some similarities with the dynamics of mean-field glassy systems, in particular, the absence of barrier crossing, we find distinctive dynamical behaviors in the two cases, showing that the statistical properties of the corresponding loss and energy landscapes are different. In contrast, when the network is under-parametrized we observe a typical glassy behavior, thus suggesting the existence of different phases depending on whether the network is under-parametrized or over-parametrized.

[1]  J. Westwater,et al.  The Mathematics of Diffusion. , 1957 .

[2]  Cugliandolo,et al.  Analytical solution of the off-equilibrium dynamics of a long-range spin-glass model. , 1993, Physical review letters.

[3]  L. Laloux,et al.  Phase space geometry and slow dynamics , 1995, cond-mat/9510079.

[4]  M. Mézard,et al.  Out of equilibrium dynamics in spin-glasses and other glassy systems , 1997, cond-mat/9702070.

[5]  Rémi Monasson,et al.  Determining computational complexity from characteristic ‘phase transitions’ , 1999, Nature.

[6]  Theory of phase-ordering kinetics , 2002 .

[7]  M. Mézard,et al.  Analytic and Algorithmic Solution of Random Satisfiability Problems , 2002, Science.

[8]  École thématique du Cnrs,et al.  Slow Relaxations and nonequilibrium dynamics in condensed matter , 2003 .

[9]  A. Dembo,et al.  Cugliandolo-Kurchan equations for dynamics of Spin-Glasses , 2004, math/0409273.

[10]  A. Cavagna,et al.  Spin-glass theory for pedestrians , 2005, cond-mat/0505032.

[11]  M. Wyart On the rigidity of amorphous solids , 2005, cond-mat/0512155.

[12]  A. Montanari,et al.  Rigorous Inequalities Between Length and Time Scales in Glassy Systems , 2006, cond-mat/0603018.

[13]  Andrea Montanari,et al.  Gibbs states and the set of solutions of random constraint satisfaction problems , 2006, Proceedings of the National Academy of Sciences.

[14]  Amin Coja-Oghlan,et al.  Algorithmic Barriers from Phase Transitions , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[15]  Andrea J. Liu,et al.  The jamming scenario - an introduction and outlook , 2010, 1006.2365.

[16]  G. Biroli,et al.  Theoretical perspective on the glass transition and amorphous materials , 2010, 1011.2578.

[17]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[18]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[19]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[20]  Yann LeCun,et al.  Explorations on high dimensional landscapes , 2014, ICLR.

[21]  Florent Krzakala,et al.  Statistical physics of inference: thresholds and algorithms , 2015, ArXiv.

[22]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[23]  G. Biroli Slow Relaxations and Non-Equilibrium Dynamics in Classical and Quantum Systems , 2015, 1507.05858.

[24]  E Weinan,et al.  Dynamics of Stochastic Gradient Algorithms , 2015, ArXiv.

[25]  L. Berthier,et al.  Models and Algorithms for the Next Generation of Glass Transition Studies , 2017, 1704.08864.

[26]  Zachary Chase Lipton Stuck in a What? Adventures in Weight Space , 2016, ArXiv.

[27]  Christian Borgs,et al.  Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes , 2016, Proceedings of the National Academy of Sciences.

[28]  X. Deng,et al.  BBX21, an Arabidopsis B-box protein, directly activates HY5 and is targeted by COP1 for 26S proteasome-mediated degradation , 2016, Proceedings of the National Academy of Sciences.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yann LeCun,et al.  Singularity of the Hessian in Deep Learning , 2016, ArXiv.

[31]  Joan Bruna,et al.  Topology and Geometry of Deep Rectified Network Optimization Landscapes , 2016 .

[32]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[33]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[34]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[35]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[36]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[37]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[38]  G. Biroli,et al.  Activated aging dynamics and effective trap model description in the random energy model , 2017, 1708.03268.

[39]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[40]  G. Ben Arous,et al.  Spectral Gap Estimates in Mean Field Spin Glasses , 2017, 1705.04243.

[41]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[42]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[43]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[44]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[45]  M. Wyart,et al.  Theory for Swap Acceleration near the Glass and Jamming Transitions for Continuously Polydisperse Particles , 2018, Physical Review X.