Maximum Principle Based Algorithms for Deep Learning

The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin's maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out - a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.

[1]  J. Gibbs Elementary Principles in Statistical Mechanics , 1902 .

[2]  Henry J. Kelley,et al.  Gradient Theory of Optimal Flight Paths , 1960 .

[3]  R. V. Gamkrelidze,et al.  THE THEORY OF OPTIMAL PROCESSES. I. THE MAXIMUM PRINCIPLE , 1960 .

[4]  L. S. Pontryagin,et al.  Mathematical Theory of Optimal Processes , 1962 .

[5]  R. Jackson,et al.  On Discrete Analogues of Pontryagin's Maximum Principle† , 1965 .

[6]  H. Halkin A Maximum Principle of the Pontryagin Type for Systems Described by Nonlinear Difference Equations , 1966 .

[7]  S. Kahne,et al.  Optimal control: An introduction to the theory and ITs applications , 1967, IEEE Transactions on Automatic Control.

[8]  R. Jennrich Asymptotic Properties of Non-Linear Least Squares Estimators , 1969 .

[9]  M. Hestenes Multiplier and gradient methods , 1969 .

[10]  R. Bellman,et al.  Introduction to Mathematical Theory of Control Processes.@@@Control and Dynamic Systems.@@@Mathematical Methods of Optimal Control. , 1974 .

[11]  A. Andrew Two-Point Boundary Value Problems: Shooting Methods , 1975 .

[12]  George M. Siouris,et al.  Applied Optimal Control: Optimization, Estimation, and Control , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[13]  A. Lyubushin Modifications of the method of successive approximations for solving optimal control problems , 1982 .

[14]  F. Chernousko,et al.  SURVEY PAPER METHOD OF SUCCESSIVE APPROXIMATIONS FOR SOLUTION OF OPTIMAL CONTROL PROBLEMS , 1982 .

[15]  R. Vidal,et al.  The discrete-time maximum principle: a survey and some new results , 1984 .

[16]  Yann LeCun,et al.  A theoretical framework for back-propagation , 1988 .

[17]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[18]  Mokhtar S. Bazaraa,et al.  Nonlinear Programming: Theory and Algorithms , 1993 .

[19]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[20]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[21]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[22]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[23]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[24]  J. Betts Survey of Numerical Methods for Trajectory Optimization , 1998 .

[25]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[26]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[27]  F. Clarke,et al.  The maximum principle in optimal control, then and now , 2005 .

[28]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[29]  H. Robbins A Stochastic Approximation Method , 1951 .

[30]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[31]  Mohamed A. El-Sharkawi,et al.  Modern heuristic optimization techniques :: theory and applications to power systems , 2008 .

[32]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[33]  Anil V. Rao,et al.  ( Preprint ) AAS 09-334 A SURVEY OF NUMERICAL METHODS FOR OPTIMAL CONTROL , 2009 .

[34]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[35]  S. Sharma,et al.  The Fokker-Planck Equation , 2010 .

[36]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[37]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[38]  Daniel Liberzon,et al.  Calculus of Variations and Optimal Control Theory: A Concise Introduction , 2012 .

[39]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[40]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[41]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[42]  N. Pogodaev Optimal control of continuity equations , 2015, 1506.08932.

[43]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[46]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[48]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[49]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[50]  Max Jaderberg,et al.  Understanding Synthetic Gradients and Decoupled Neural Interfaces , 2017, ICML.

[51]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[52]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[53]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[54]  Alfio Borzì,et al.  Numerical Investigation of a Class of Liouville Control Problems , 2017, J. Sci. Comput..

[55]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[56]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[57]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[58]  Eldad Haber,et al.  Reversible Architectures for Arbitrarily Deep Residual Neural Networks , 2017, AAAI.