Penalty and Augmented Lagrangian Methods for Layer-parallel Training of Residual Networks

Algorithms for training residual networks (ResNets) typically require forward pass of data, followed by backpropagating of loss gradient to perform parameter updates, which can take many hours or even days for networks with hundreds of layers. Inspired by the penalty and augmented Lagrangian methods, a layer-parallel training algorithm is proposed in this work to overcome the scalability barrier caused by the serial nature of forward-backward propagation in deep residual learning. Moreover, by viewing the supervised classification task as a numerical discretization of the terminal control problem, we bridge the concept of synthetic gradient for decoupling backpropagation with the parareal method for solving differential equations, which not only offers a novel perspective on the design of synthetic loss function but also performs parameter updates with reduced storage overhead. Experiments on a preliminary example demonstrate that the proposed algorithm achieves comparable or even better testing accuracy to the full serial backpropagation approach, while enabling layer-parallelism can provide speedup over the traditional layer-serial training methods.

[1]  Nikhil R. Devanur,et al.  PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.

[2]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[3]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[4]  Lars Ruthotto,et al.  Layer-Parallel Training of Deep Residual Neural Networks , 2018, SIAM J. Math. Data Sci..

[5]  Hojung Lee,et al.  Local Critic Training of Deep Neural Networks , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[6]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[7]  Martin Jaggi,et al.  Decoupling Backpropagation using Constrained Optimization Methods , 2018 .

[8]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[9]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[10]  Brian Kingsbury,et al.  Beyond Backprop: Online Alternating Minimization with Auxiliary Variables , 2018, ICML.

[11]  Daniel Liberzon,et al.  Calculus of Variations and Optimal Control Theory: A Concise Introduction , 2012 .

[12]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[13]  Yuan Yao,et al.  A Convergence Analysis of Nonlinearly Constrained ADMM in Deep Learning , 2019, ArXiv.

[14]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[15]  Kurt Keutzer,et al.  ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs , 2019, IJCAI.

[16]  Long Chen,et al.  Maximum Principle Based Algorithms for Deep Learning , 2017, J. Mach. Learn. Res..

[17]  Martin J. Gander,et al.  Nonlinear Convergence Analysis for the Parareal Algorithm , 2008 .

[18]  F. Tröltzsch Optimal Control of Partial Differential Equations: Theory, Methods and Applications , 2010 .

[19]  Panos Parpas,et al.  Predict Globally, Correct Locally: Parallel-in-Time Optimal Control of Neural Networks , 2019, ArXiv.

[20]  Y. Maday,et al.  A parareal in time procedure for the control of partial differential equations , 2002 .

[21]  Thomas Paine,et al.  GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training , 2013, ICLR.

[22]  Ulrich Langer,et al.  Domain decomposition methods in science and engineering XVII , 2008 .

[23]  Evangelos A. Theodorou,et al.  Deep Learning Theory Review: An Optimal Control and Dynamical Systems Perspective , 2019, ArXiv.

[24]  Lars Ruthotto,et al.  Learning Across Scales - Multiscale Methods for Convolution Neural Networks , 2018, AAAI.

[25]  Rolf Rannacher,et al.  Multiple Shooting and Time Domain Decomposition Methods , 2015 .

[26]  Jinshan Zeng,et al.  On ADMM in Deep Learning: Convergence and Saturation-Avoidance , 2019 .

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Takeru Miyato,et al.  Synthetic Gradient Methods with Virtual Forward-Backward Networks , 2017, ICLR.

[29]  M. Thorpe,et al.  Deep limits of residual neural networks , 2018, Research in the Mathematical Sciences.

[30]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[31]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[33]  Sebastian Götschel,et al.  An Efficient Parallel-in-Time Method for Optimization with Parabolic PDEs , 2019, SIAM J. Sci. Comput..

[34]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[35]  Max Gunzburger,et al.  Perspectives in flow control and optimization , 1987 .

[36]  Y. Maday,et al.  An adaptive parareal algorithm☆ , 2020, Journal of Computational and Applied Mathematics.

[37]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[38]  Jong-Seok Lee,et al.  Local Critic Training for Model-Parallel Learning of Deep Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[39]  Robert Hecht-Nielsen,et al.  Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[40]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[41]  Frederick Tung,et al.  Multi-level Residual Networks from Dynamical Systems View , 2017, ICLR.