Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is $O(1)$ is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near $1$ is a property known as dynamical isometry. For deep linear networks, dynamical isometry can be achieved through orthogonal weight initialization and has been shown to dramatically speed up learning; however, it has remained unclear how to extend these results to the nonlinear setting. We address this question by employing powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Indeed, we show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

[1]  R. Speicher Multiplicative functions on the lattice of non-crossing partitions and free convolution , 1994 .

[2]  O. Johnson Free Random Variables , 2004 .

[3]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[4]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[5]  T. Tao Topics in Random Matrix Theory , 2012 .

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Thorsten Neuschel Plancherel-Rotach formulae for average characteristic polynomials of products of Ginibre random matrices and the Fuss-Catalan distribution , 2013, 1311.0365.

[8]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[9]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[10]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Leonidas J. Guibas,et al.  Deep Knowledge Tracing , 2015, NIPS.

[13]  Surya Ganguli,et al.  Deep Learning Models of the Retinal Response to Natural Scenes , 2017, NIPS.

[14]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[15]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[16]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[17]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[18]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[19]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.