Exploring Learning Dynamics of DNNs via Layerwise Conditioning Analysis

Conditioning analysis uncovers the landscape of optimization objective by exploring the spectrum of its curvature matrix. It is well explored theoretically for linear models. We extend this analysis to deep neural networks (DNNs). To this end, we propose a layer-wise conditioning analysis that explores the optimization landscape with respect to each layer independently. Such an analysis is theoretically supported under mild assumptions that approximately hold in practice. Based on our analysis, we show that batch normalization (BN) can adjust the magnitude of the layer activations/gradients, and thus stabilizes the training. However, such a stabilization can result in a false impression of a local minimum, which sometimes has detrimental effects on the learning. Besides, we experimentally observe that BN can improve the layer-wise conditioning of the optimization problem. Finally, we observe that the last linear layer of very deep residual network has ill-conditioned behavior during training. We solve this problem by only adding one BN layer before the last linear layer, which achieves improved performance over the original residual networks, especially when the networks are deep.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Carla P. Gomes,et al.  Understanding Batch Normalization , 2018, NeurIPS.

[3]  Elad Hoffer,et al.  Norm matters: efficient and accurate normalization schemes in deep networks , 2018, NeurIPS.

[4]  Guillaume Hennequin,et al.  Exact natural gradient in deep linear networks and its application to the nonlinear case , 2018, NeurIPS.

[5]  Lei Huang,et al.  Centered Weight Normalization in Accelerating Training of Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[7]  Nicol N. Schraudolph,et al.  Accelerated Gradient Descent by Factor-Centering Decomposition , 1998 .

[8]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[9]  Thomas Hofmann,et al.  Towards a Theoretical Understanding of Batch Normalization , 2018, ArXiv.

[10]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[11]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[12]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[13]  Ilya Sutskever,et al.  Estimating the Hessian by Back-propagating Curvature , 2012, ICML.

[14]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[15]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[16]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[17]  Frank Nielsen,et al.  Relative Fisher Information and Natural Gradient for Learning Large Modular Models , 2017, ICML.

[18]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[20]  Yuan Xie,et al.  $L1$ -Norm Batch Normalization for Efficient Training of Deep Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Vardan Papyan,et al.  The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size , 2018, ArXiv.

[22]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[23]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[24]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[26]  Jascha Sohl-Dickstein,et al.  A Mean Field Theory of Batch Normalization , 2019, ICLR.

[27]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[28]  Lei Huang,et al.  Decorrelated Batch Normalization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[30]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[33]  Roger B. Grosse,et al.  Distributed Second-Order Optimization using Kronecker-Factored Approximations , 2016, ICLR.

[34]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[35]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yann LeCun,et al.  Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.

[38]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[39]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[40]  Hermann Ney,et al.  A Convergence Analysis of Log-Linear Training , 2011, NIPS.

[41]  Michael Möller,et al.  Proximal Backpropagation , 2017, ICLR.