Fractional moment-preserving initialization schemes for training fully-connected neural networks

A traditional approach to initialization in deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations. On the other hand, several studies show that during the training process, the distribution of stochastic gradients can be heavy-tailed especially for small batch sizes. In this case, weights and therefore pre-activations can be modeled with a heavy-tailed distribution that has an infinite variance but has a finite (non-integer) fractional moment of order $s$ with $s<2$. Motivated by this fact, we develop initialization schemes for fully connected feed-forward networks that can provably preserve any given moment of order $s \in (0, 2]$ over the layers for a class of activations including ReLU, Leaky ReLU, Randomized Leaky ReLU, and linear activations. These generalized schemes recover traditional initialization schemes in the limit $s \to 2$ and serve as part of a principled theory for initialization. For all these schemes, we show that the network output admits a finite almost sure limit as the number of layers grows, and the limit is heavy-tailed in some settings. This sheds further light into the origins of heavy tail during signal propagation in DNNs. We prove that the logarithm of the norm of the network outputs, if properly scaled, will converge to a Gaussian distribution with an explicit mean and variance we can compute depending on the activation used, the value of s chosen and the network width. We also prove that our initialization scheme avoids small network output values more frequently compared to traditional approaches. Furthermore, the proposed initialization strategy does not have an extra cost during the training procedure. We show through numerical experiments that our initialization can improve the training and test performance.

[1]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[2]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Persi Diaconis,et al.  Iterated Random Functions , 1999, SIAM Rev..

[4]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[5]  David Steinsaltz,et al.  Locally Contractive Iterated Function Systems , 1999 .

[6]  Fei Wang,et al.  Deep learning for healthcare: review, opportunities and challenges , 2018, Briefings Bioinform..

[7]  Steve Kroon,et al.  Critical initialisation for deep signal propagation in noisy rectifier neural networks , 2018, NeurIPS.

[8]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[9]  Surya Ganguli,et al.  Analyzing noise in autoencoders and deep networks , 2014, ArXiv.

[10]  Robert C. Qiu,et al.  Spectrum Concentration in Deep Residual Learning: A Free Probability Approach , 2018, IEEE Access.

[11]  Praneeth Netrapalli,et al.  Non-Gaussianity of Stochastic Gradient Noise , 2019, ArXiv.

[12]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[13]  Yann Dauphin,et al.  MetaInit: Initializing learning by learning to initialize , 2019, NeurIPS.

[14]  Noel A Cressie,et al.  The moment generating function has its moments , 1986 .

[15]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[16]  M. Nica,et al.  Products of Many Large Random Matrices and Gradients in Deep Neural Networks , 2018, Communications in Mathematical Physics.

[17]  Samuel S. Schoenholz,et al.  Mean Field Residual Networks: On the Edge of Chaos , 2017, NIPS.

[18]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[19]  Charles M. Newman,et al.  The Stability of Large Random Matrices and Their Products , 1984 .

[20]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[21]  Andreas Veit,et al.  Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.

[22]  Aaron Defazio,et al.  Scaling Laws for the Principled Design, Initialization and Preconditioning of ReLU Networks , 2019, ArXiv.

[23]  Levent Sagun,et al.  A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.

[24]  J. Elton A multiplicative ergodic theorem for lipschitz maps , 1990 .

[25]  Sashank J. Reddi,et al.  Why ADAM Beats SGD for Attention Models , 2019, ArXiv.

[26]  A. Laforgia,et al.  On the asymptotic expansion of a ratio of gamma functions , 2012 .

[27]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[28]  Lu Lu,et al.  Dying ReLU and Initialization: Theory and Numerical Examples , 2019, Communications in Computational Physics.

[29]  M. G. Bulmer,et al.  Principles of Statistics. , 1969 .

[30]  C. Walck Hand-book on statistical distributions for experimentalists , 1996 .

[31]  Josef Hadar,et al.  Rules for Ordering Uncertain Prospects , 1969 .

[32]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[33]  Fernando A. Mujica,et al.  An Empirical Evaluation of Deep Learning on Highway Driving , 2015, ArXiv.

[34]  David Rolnick,et al.  How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.

[35]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[36]  B. L. Kalman,et al.  Why tanh: choosing a sigmoidal function , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[37]  David Sussillo,et al.  Random Walks: Training Very Deep Nonlinear Feed-Forward Networks with Smart Initialization , 2014, ArXiv.

[38]  Gaël Richard,et al.  On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks , 2019, ArXiv.

[39]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[40]  S. Foss,et al.  An Introduction to Heavy-Tailed and Subexponential Distributions , 2011 .

[41]  H. Sebastian Seung,et al.  Variance-Preserving Initialization Schemes Improve Deep Network Training: But Which Variance is Preserved? , 2019, ArXiv.

[42]  Daniel Soudry,et al.  A Mean Field Theory of Quantized Deep Networks: The Quantization-Depth Trade-Off , 2019, NeurIPS.

[43]  Kyle L. Luther,et al.  Sample Variance Decay in Randomly Initialized ReLU Networks , 2019 .

[44]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  A. C. Berry The accuracy of the Gaussian approximation to the sum of independent variates , 1941 .

[46]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[47]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[48]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[49]  Boris Hanin,et al.  Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? , 2018, NeurIPS.

[50]  S. Resnick Heavy-Tail Phenomena: Probabilistic and Statistical Modeling , 2006 .

[51]  Jan Hendrik Witte,et al.  Deep Learning for Finance: Deep Portfolios , 2016 .

[52]  Milan Merkle,et al.  Logarithmic convexity and inequalities for the gamma function , 1996 .

[53]  L. Arnold,et al.  Lyapunov exponents of linear stochastic systems , 1986 .

[54]  Gill A. Pratt,et al.  Is a cambrian explosion coming for robotics , 2015 .

[55]  Sebastian Mentemeier,et al.  On multidimensional Mandelbrot cascades , 2014 .

[56]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[57]  Michael W. Mahoney,et al.  Traditional and Heavy-Tailed Self Regularization in Neural Network Models , 2019, ICML.

[58]  Arnaud Doucet,et al.  On the Selection of Initialization and Activation Function for Deep Neural Networks , 2018, ArXiv.