Layer-Wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs

Conditioning analysis uncovers the landscape of an optimization objective by exploring the spectrum of its curvature matrix. This has been well explored theoretically for linear models. We extend this analysis to deep neural networks (DNNs) in order to investigate their learning dynamics. To this end, we propose layer-wise conditioning analysis, which explores the optimization landscape with respect to each layer independently. Such an analysis is theoretically supported under mild assumptions that approximately hold in practice. Based on our analysis, we show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum, which has detrimental effects on the learning. Besides, we experimentally observe that BN can improve the layer-wise conditioning of the optimization problem. Finally, we find that the last linear layer of a very deep residual network displays ill-conditioned behavior. We solve this problem by only adding one BN layer before the last linear layer, which achieves improved performance over the original and pre-activation residual networks.

[1]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[2]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[3]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[4]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[5]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[6]  Roger B. Grosse,et al.  Distributed Second-Order Optimization using Kronecker-Factored Approximations , 2016, ICLR.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yann LeCun,et al.  Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.

[9]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[10]  Nicol N. Schraudolph,et al.  Accelerated Gradient Descent by Factor-Centering Decomposition , 1998 .

[11]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[12]  Tie-Yan Liu,et al.  On the Local Hessian in Back-propagation , 2018, NeurIPS.

[13]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[14]  Hermann Ney,et al.  A Convergence Analysis of Log-Linear Training , 2011, NIPS.

[15]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[16]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[17]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[18]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[19]  Jascha Sohl-Dickstein,et al.  A Mean Field Theory of Batch Normalization , 2019, ICLR.

[20]  Carla P. Gomes,et al.  Understanding Batch Normalization , 2018, NeurIPS.

[21]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[22]  Guillaume Hennequin,et al.  Exact natural gradient in deep linear networks and its application to the nonlinear case , 2018, NeurIPS.

[23]  Frank Nielsen,et al.  Relative Fisher Information and Natural Gradient for Learning Large Modular Models , 2017, ICML.

[24]  Lei Huang,et al.  Centered Weight Normalization in Accelerating Training of Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  Elad Hoffer,et al.  Norm matters: efficient and accurate normalization schemes in deep networks , 2018, NeurIPS.

[27]  Michael Möller,et al.  Proximal Backpropagation , 2017, ICLR.

[28]  David J. Schwab,et al.  Mean-field Analysis of Batch Normalization , 2019, ArXiv.

[29]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[30]  Vardan Papyan,et al.  The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size , 2018, ArXiv.

[31]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[32]  Shun-ichi Amari,et al.  Universal statistics of Fisher information in deep neural networks: mean field approach , 2018, AISTATS.

[33]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[34]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[35]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[36]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[37]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[39]  Lei Huang,et al.  Iterative Normalization: Beyond Standardization Towards Efficient Whitening , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Lei Huang,et al.  Decorrelated Batch Normalization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[42]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Ilya Sutskever,et al.  Estimating the Hessian by Back-propagating Curvature , 2012, ICML.

[45]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[46]  Yuan Xie,et al.  $L1$ -Norm Batch Normalization for Efficient Training of Deep Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[47]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[48]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[50]  Thomas Hofmann,et al.  Towards a Theoretical Understanding of Batch Normalization , 2018, ArXiv.

[51]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[52]  Yann LeCun,et al.  Effiicient BackProp , 1996, Neural Networks: Tricks of the Trade.

[53]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[54]  Thomas Hofmann,et al.  Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization , 2018, AISTATS.