Multilevel Initialization for Layer-Parallel Deep Neural Network Training

This paper investigates multilevel initialization strategies for training very deep neural networks with a layer-parallel multigrid solver. The scheme is based on the continuous interpretation of the training problem as a problem of optimal control, in which neural networks are represented as discretizations of time-dependent ordinary differential equations. A key goal is to develop a method able to intelligently initialize the network parameters for the very deep networks enabled by scalable layer-parallel training. To do this, we apply a refinement strategy across the time domain, that is equivalent to refining in the layer dimension. The resulting refinements create deep networks, with good initializations for the network parameters coming from the coarser trained networks. We investigate the effectiveness of such multilevel "nested iteration" strategies for network training, showing supporting numerical evidence of reduced run time for equivalent accuracy. In addition, we study whether the initialization strategies provide a regularizing effect on the overall training process and reduce sensitivity to hyperparameters and randomness in initial network parameters.

[1]  Lei Tang,et al.  Efficiency‐based h‐ and hp‐refinement strategies for finite element methods , 2008, Numer. Linear Algebra Appl..

[2]  S. McCormick,et al.  A multigrid tutorial (2nd ed.) , 2000 .

[3]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[4]  M. F. Baumgardner,et al.  220 Band AVIRIS Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3 , 2015 .

[5]  Stefano Soatto,et al.  Deep relaxation: partial differential equations for optimizing deep neural networks , 2017, Research in the Mathematical Sciences.

[6]  Lydia Kronsjö,et al.  On the design of nested iterations for elliptic difference equations , 1972 .

[7]  M. Heinkenschloss,et al.  Large-Scale PDE-Constrained Optimization: An Introduction , 2003 .

[8]  Lars Ruthotto,et al.  Layer-Parallel Training of Deep Residual Neural Networks , 2018, SIAM J. Math. Data Sci..

[9]  J. Lions Optimal Control of Systems Governed by Partial Differential Equations , 1971 .

[10]  Wolfgang Hackbusch,et al.  On the convergence of multi-grid iterations , 1981 .

[11]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Eldad Haber,et al.  Reversible Architectures for Arbitrarily Deep Residual Neural Networks , 2017, AAAI.

[14]  F. Tröltzsch Optimal Control of Partial Differential Equations: Theory, Methods and Applications , 2010 .

[15]  Mauro Perego,et al.  Robust Training and Initialization of Deep Neural Networks: An Adaptive Basis Viewpoint , 2019, MSML.

[16]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[17]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[18]  David Rolnick,et al.  How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.

[19]  J. L. Peterson,et al.  Deep Neural Network Initialization With Decision Trees , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[21]  Lydia Kronsjö,et al.  A note on the “nested iterations” method , 1975 .

[22]  Lei Tang,et al.  Efficiency Based Adaptive Local Refinement for First-Order System Least-Squares Formulations , 2011, SIAM J. Sci. Comput..

[23]  Eldad Haber,et al.  Deep Neural Networks Motivated by Partial Differential Equations , 2018, Journal of Mathematical Imaging and Vision.

[24]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.