Symplectic Adjoint Method for Exact Gradient of Neural ODE with Minimal Memory

A neural network model of a differential equation, namely neural ODE, has enabled us to learn continuous-time dynamical systems and probabilistic distributions with a high accuracy. It uses the same network repeatedly during a numerical integration. Hence, the backpropagation algorithm requires a memory footprint proportional to the number of uses times the network size. This is true even if a checkpointing scheme divides the computational graph into sub-graphs. Otherwise, the adjoint method obtains a gradient by a numerical integration backward in time with a minimal memory footprint; however, it suffers from numerical errors. This study proposes the symplectic adjoint method, which obtains the exact gradient (up to rounding error) with a footprint proportional to the number of uses plus the network size. The experimental results demonstrate the symplectic adjoint method occupies the smallest footprint in most cases, functions faster in some cases, and is robust to a rounding error among competitive methods.

[1]  Qiqi Wang,et al.  Forward and adjoint sensitivity computation of chaotic dynamical systems , 2012, J. Comput. Phys..

[2]  C. Scovel,et al.  On quadratic invariants and symplectic structure , 1994 .

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Ming-Yu Liu,et al.  PointFlow: 3D Point Cloud Generation With Continuous Normalizing Flows , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Takashi Matsubara,et al.  Deep Energy-based Modeling of Discrete-Time Physics , 2020, NeurIPS.

[6]  Byron Boots,et al.  Euclideanizing Flows: Diffeomorphic Reduction for Learning Stable Dynamical Systems , 2020, L4DC.

[7]  M. Hutchinson A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines , 1989 .

[8]  R. Errico What is an adjoint model , 1997 .

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[11]  Takeshi Teshima,et al.  Universal Approximation Property of Neural Ordinary Differential Equations , 2020, ArXiv.

[12]  Naoya Takeishi,et al.  Learning Dynamics Models with Stable Invariant Sets , 2020, AAAI.

[13]  Jesús María Sanz-Serna,et al.  Symplectic Runge-Kutta Schemes for Adjoint Equations, Automatic Differentiation, Optimal Control, and More , 2015, SIAM Rev..

[14]  Jianyu Zhang,et al.  Symplectic Recurrent Neural Networks , 2020, ICLR.

[15]  Kurt Keutzer,et al.  ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs , 2019, IJCAI.

[16]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[17]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.

[18]  J. Duncan,et al.  Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE , 2020, ICML.

[19]  Raquel Urtasun,et al.  The Reversible Residual Network: Backpropagation Without Storing Activations , 2017, NIPS.

[20]  Calypso Herrera,et al.  Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering , 2020, ICLR.

[21]  Marc Peter Deisenroth,et al.  Variational Integrator Networks for Physically Meaningful Embeddings , 2019, ArXiv.

[22]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[23]  J. Dormand,et al.  A reconsideration of some embedded Runge-Kutta formulae , 1986 .

[24]  Nam Soo Kim,et al.  WaveNODE: A Continuous Normalizing Flow for Speech Synthesis , 2020, ArXiv.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[27]  Terry Lyons,et al.  Neural Controlled Differential Equations for Irregular Time Series , 2020, NeurIPS.

[28]  Jason Yosinski,et al.  Hamiltonian Neural Networks , 2019, NeurIPS.

[29]  Sekhar Tatikonda,et al.  MALI: A memory efficient and reverse accurate integrator for Neural ODEs , 2021, ICLR.

[30]  Ernst Hairer,et al.  Solving Ordinary Differential Equations I: Nonstiff Problems , 2009 .

[31]  Alex Graves,et al.  Memory-Efficient Backpropagation Through Time , 2016, NIPS.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Andreas Griewank,et al.  Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation , 2000, TOMS.

[34]  Takeru Matsuda,et al.  Generalization of partitioned Runge-Kutta methods for adjoint systems , 2020, J. Comput. Appl. Math..

[35]  Carol S. Woodward,et al.  Enabling New Flexibility in the SUNDIALS Suite of Nonlinear and Differential/Algebraic Equation Solvers , 2020, ACM Trans. Math. Softw..

[36]  E. Hairer,et al.  Geometric Numerical Integration: Structure Preserving Algorithms for Ordinary Differential Equations , 2004 .

[37]  D. Furihata,et al.  Discrete Variational Derivative Method: A Structure-Preserving Numerical Method for Partial Differential Equations , 2010 .

[38]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[39]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[40]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[41]  David Duvenaud,et al.  Scalable Gradients for Stochastic Differential Equations , 2020, AISTATS.

[42]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[43]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[44]  Ali Ramadhan,et al.  Universal Differential Equations for Scientific Machine Learning , 2020, ArXiv.

[45]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.