The Loss Surfaces of Multilayer Networks

We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between largeand small-size networks where for the latter poor quality local minima have nonzero probability of being recovered. Finally, we prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

[1]  E. Wigner On the Distribution of the Roots of Certain Symmetric Matrices , 1958 .

[2]  Sompolinsky,et al.  Spin-glass models of neural networks. , 1985, Physical review. A, General physics.

[3]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[4]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[5]  Saad,et al.  Exact solution for on-line learning in multilayer neural networks. , 1995, Physical review letters.

[6]  V. Dotsenko An Introduction to the Theory of Spin Glasses and Neural Networks , 1995 .

[7]  K. Nakanishi,et al.  Mean-field theory for a spin-glass model of neural networks: TAP free energy and the paramagnetic to spin-glass transition , 1997, cond-mat/9705015.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  E. Giné,et al.  Decoupling, from Dependence to Independence, Randomly Stopped Processes, U-Statistics and Processes, Martingales and beyond , 2000 .

[10]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[11]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[12]  Yan V Fyodorov,et al.  Replica Symmetry Breaking Condition Exposed by Random Matrix Calculation of Landscape Complexity , 2007, cond-mat/0702601.

[13]  A. Bray,et al.  Statistics of critical points of Gaussian fields on large-dimensional spaces. , 2006, Physical review letters.

[14]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[15]  Antonio Auffinger,et al.  Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[16]  Jürgen Schmidhuber,et al.  Multi-column deep neural network for traffic sign classification , 2012, Neural Networks.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[19]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[20]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[21]  Antonio Auffinger,et al.  Complexity of random smooth functions on the high-dimensional sphere , 2011, 1110.5872.

[22]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[23]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[24]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[25]  Jason Weston,et al.  #TagSpace: Semantic Embeddings from Hashtags , 2014, EMNLP.

[26]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.