Second-Order Neural ODE Optimizer

We propose a novel second-order optimization framework for training the emerging deep continuous-time models, specifically the Neural Ordinary Differential Equations (Neural ODEs). Since their training already involves expensive gradient computation by solving a backward ODE, deriving efficient second-order methods becomes highly nontrivial. Nevertheless, inspired by the recent Optimal Control (OC) interpretation of training deep networks, we show that a specific continuoustime OC methodology, called Differential Programming, can be adopted to derive backward ODEs for higher-order derivatives at the same O(1) memory cost. We further explore a low-rank representation of the second-order derivatives and show that it leads to efficient preconditioned updates with the aid of Kronecker-based factorization. The resulting method – named SNOpt – converges much faster than first-order baselines in wall-clock time, and the improvement remains consistent across various applications, e.g. image classification, generative flow, and timeseries prediction. Our framework also enables direct architecture optimization, such as the integration time of Neural ODEs, with second-order feedback policies, strengthening the OC perspective as a principled tool of analyzing optimization in deep learning. Our code is available at https://github.com/ghliu/snopt.

[1]  Jimmy Ba,et al.  Kronecker-factored Curvature Approximations for Recurrent Neural Networks , 2018, ICLR.

[2]  J. Dormand,et al.  A family of embedded Runge-Kutta formulae , 1980 .

[3]  Michael Flynn,et al.  The UEA multivariate time series classification archive, 2018 , 2018, ArXiv.

[4]  Bobbi Jo Broxson The Kronecker Product , 2006 .

[5]  David Duvenaud,et al.  Latent ODEs for Irregularly-Sampled Time Series , 2019, ArXiv.

[6]  Matthew J. Johnson,et al.  Learning Differential Equations that are Easy to Solve , 2020, NeurIPS.

[7]  Adam M. Oberman,et al.  How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization , 2020, ICML.

[8]  Pascal Vincent,et al.  Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , 2018, NeurIPS.

[9]  Evangelos A. Theodorou,et al.  Differential Dynamic Programming Neural Optimizer , 2020, ArXiv.

[10]  Rajesh P. N. Rao,et al.  Bayesian brain : probabilistic approaches to neural coding , 2006 .

[11]  Christopher De Sa,et al.  Neural Manifold Ordinary Differential Equations , 2020, NeurIPS.

[12]  Anna Kazeykina,et al.  Mean-field Langevin System, Optimal Control and Deep Neural Networks , 2019, ArXiv.

[13]  Evangelos A. Theodorou,et al.  Dynamic Game Theoretic Neural Optimizer , 2021, ICML.

[14]  Matthias Gerdts,et al.  Free finite horizon LQR: A bilevel perspective and its application to model predictive control , 2019, Autom..

[15]  Yuval Tassa,et al.  Stochastic Differential Dynamic Programming , 2010, Proceedings of the 2010 American Control Conference.

[16]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[17]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[18]  Sekhar Tatikonda,et al.  MALI: A memory efficient and reverse accurate integrator for Neural ODEs , 2021, ICLR.

[19]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[20]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[21]  Yoram Singer,et al.  Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[22]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[23]  Long Chen,et al.  Maximum Principle Based Algorithms for Deep Learning , 2017, J. Mach. Learn. Res..

[24]  Wei Sun,et al.  Model Based Reinforcement Learning with Final Time Horizon Optimization , 2015, ArXiv.

[25]  Thomas Serre,et al.  Go with the flow: Adaptive control for Neural ODEs , 2020, ICLR.

[26]  Guodong Zhang,et al.  Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks , 2019, NeurIPS.

[27]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[28]  Pascal Vincent,et al.  An Evaluation of Fisher Approximations Beyond Kronecker Factorization , 2018, ICLR.

[29]  Richard G. Baraniuk,et al.  InfoCNF: An Efficient Conditional Continuous Normalizing Flow with Adaptive Solvers , 2019, ArXiv.

[30]  Zidong Wang,et al.  A Trace-restricted Kronecker-Factored Approximation to Natural Gradient , 2020, ArXiv.

[31]  Amit Chakraborty,et al.  Symplectic ODE-Net: Learning Hamiltonian Dynamics with Control , 2020, ICLR.

[32]  Roger B. Grosse,et al.  Distributed Second-Order Optimization using Kronecker-Factored Approximations , 2016, ICLR.

[33]  Nader Sadegh,et al.  Infinite Horizon Nonlinear Quadratic Cost Regulator , 2019, 2019 American Control Conference (ACC).

[34]  M. L. Chambers The Mathematical Theory of Optimal Processes , 1965 .

[35]  Kurt Keutzer,et al.  ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs , 2019, IJCAI.

[36]  Rong Ge,et al.  Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks , 2020, ArXiv.

[37]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[38]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[39]  Patrick Kidger,et al.  "Hey, that's not an ODE": Faster ODE Adjoints with 12 Lines of Code , 2020, ArXiv.

[40]  Hajime Asama,et al.  Dissecting Neural ODEs , 2020, NeurIPS.

[41]  Evangelos A. Theodorou,et al.  Deep Learning Theory Review: An Optimal Control and Dynamical Systems Perspective , 2019, ArXiv.

[42]  E Weinan,et al.  A mean-field optimal control formulation of deep learning , 2018, Research in the Mathematical Sciences.

[43]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[44]  Kurt Keutzer,et al.  Inefficiency of K-FAC for Large Batch Size Training , 2019, AAAI.

[45]  Yuval Tassa,et al.  Control-limited differential dynamic programming , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Philip H. S. Torr,et al.  STEER : Simple Temporal Regularization For Neural ODEs , 2020, NeurIPS.

[47]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[48]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[49]  Yann Le Cun,et al.  A Theoretical Framework for Back-Propagation , 1988 .

[50]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[51]  Terry Lyons,et al.  Neural Controlled Differential Equations for Irregular Time Series , 2020, NeurIPS.

[52]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[53]  Maximilian Nickel,et al.  Riemannian Continuous Normalizing Flows , 2020, NeurIPS.

[54]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.

[55]  Xingjian Li,et al.  OT-Flow: Fast and Accurate Continuous Normalizing Flows via Optimal Transport , 2020, ArXiv.

[56]  J. Duncan,et al.  Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE , 2020, ICML.