Predict Globally, Correct Locally: Parallel-in-Time Optimal Control of Neural Networks

The links between optimal control of dynamical systems and neural networks have proved beneficial both from a theoretical and from a practical point of view. Several researchers have exploited these links to investigate the stability of different neural network architectures and develop memory efficient training algorithms. We also adopt the dynamical systems view of neural networks, but our aim is different from earlier works. We exploit the links between dynamical systems, optimal control, and neural networks to develop a novel distributed optimization algorithm. The proposed algorithm addresses the most significant obstacle for distributed algorithms for neural network optimization: the network weights cannot be updated until the forward propagation of the data, and backward propagation of the gradients are complete. Using the dynamical systems point of view, we interpret the layers of a (residual) neural network as the discretized dynamics of a dynamical system and exploit the relationship between the co-states (adjoints) of the optimal control problem and backpropagation. We then develop a parallel-in-time method that updates the parameters of the network without waiting for the forward or back propagation algorithms to complete in full. We establish the convergence of the proposed algorithm. Preliminary numerical results suggest that the algorithm is competitive and more efficient than the state-of-the-art.

[1]  Brian Kingsbury,et al.  Beyond Backprop: Alternating Minimization with co-Activation Memory , 2018, ArXiv.

[2]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[3]  Stephen Berard,et al.  Implications of Historical Trends in the Electrical Efficiency of Computing , 2011, IEEE Annals of the History of Computing.

[4]  Lars Ruthotto,et al.  Layer-Parallel Training of Deep Residual Neural Networks , 2018, SIAM J. Math. Data Sci..

[5]  Yann Le Cun,et al.  A Theoretical Framework for Back-Propagation , 1988 .

[6]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[7]  Hector O. Fattorini,et al.  Infinite Dimensional Optimization and Control Theory: References , 1999 .

[8]  Katya Scheinberg,et al.  Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning , 2017, ArXiv.

[9]  Yuan Yao,et al.  A Proximal Block Coordinate Descent Algorithm for Deep Neural Network Training , 2018, ICLR.

[10]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[11]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[12]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[13]  J. Lambert Numerical Methods for Ordinary Differential Equations , 1991 .

[14]  Bin Gu,et al.  Decoupled Parallel Backpropagation with Convergence Guarantee , 2018, ICML.

[15]  Yuan Yao,et al.  Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees , 2018, ArXiv.

[16]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[17]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[18]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[19]  Brian Kingsbury,et al.  Beyond Backprop: Online Alternating Minimization with Auxiliary Variables , 2018, ICML.

[20]  Long Chen,et al.  Maximum Principle Based Algorithms for Deep Learning , 2017, J. Mach. Learn. Res..

[21]  Martin J. Gander,et al.  Nonlinear Convergence Analysis for the Parareal Algorithm , 2008 .

[22]  M. Bardi,et al.  Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations , 1997 .

[23]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[24]  William L. Briggs,et al.  A multigrid tutorial , 1987 .