Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE

Neural ordinary differential equations (NODEs) have recently attracted increasing attention; however, their empirical performance on benchmark tasks (e.g. image classification) are significantly inferior to discrete-layer models. We demonstrate an explanation for their poorer performance is the inaccuracy of existing gradient estimation methods: the adjoint method has numerical errors in reverse-mode integration; the naive method directly back-propagates through ODE solvers, but suffers from a redundantly deep computation graph when searching for the optimal stepsize. We propose the Adaptive Checkpoint Adjoint (ACA) method: in automatic differentiation, ACA applies a trajectory checkpoint strategy which records the forward-mode trajectory as the reverse-mode trajectory to guarantee accuracy; ACA deletes redundant components for shallow computation graphs; and ACA supports adaptive solvers. On image classification tasks, compared with the adjoint and naive method, ACA achieves half the error rate in half the training time; NODE trained with ACA outperforms ResNet in both accuracy and test-retest reliability. On time-series modeling, ACA outperforms competing methods. Finally, in an example of the three-body problem, we show NODE with ACA can incorporate physical knowledge to achieve better accuracy. We provide the PyTorch implementation of ACA: https://github.com/juntang-zhuang/torch-ACA.

[1]  H. H. Rosenbrock,et al.  Some general implicit processes for the numerical solution of differential equations , 1963, Comput. J..

[2]  M. L. Chambers The Mathematical Theory of Optimal Processes , 1965 .

[3]  H. Hermes,et al.  Foundations of optimal control theory , 1968 .

[4]  A. Hindmarsh LSODE and LSODI, two new initial value ordinary differential equation solvers , 1980, SGNM.

[5]  J. Dormand,et al.  A family of embedded Runge-Kutta formulae , 1980 .

[6]  D. S. Jones,et al.  Differential Equations and Mathematical Biology , 1983 .

[7]  A. Patera A spectral element method for fluid dynamics: Laminar flow in a channel expansion , 1984 .

[8]  D. Altman,et al.  STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT , 1986, The Lancet.

[9]  J M Bland,et al.  Statistical methods for assessing agreement between two methods of clinical measurement , 1986 .

[10]  F. Krogh,et al.  Solving Ordinary Differential Equations , 2019, Programming for Computations - Python.

[11]  G. D. Byrne,et al.  VODE: a variable-coefficient ODE solver , 1989 .

[12]  M Davies,et al.  The Structured Clinical Interview for DSM-III-R (SCID). II. Multisite test-retest reliability. , 1992 .

[13]  A. M. Lyapunov The general problem of the stability of motion , 1992 .

[14]  J. Barrow-Green Poincare and the Three Body Problem , 1996 .

[15]  R. Geroch Partial Differential Equations of Physics , 1996, gr-qc/9602055.

[16]  B. Wilamowski,et al.  Hamiltonian neural nets as a universal signal processor , 2002, IEEE 2002 28th Annual Conference of the Industrial Electronics Society. IECON 02.

[17]  William H. Press,et al.  Numerical recipes in C , 2002 .

[18]  Charbel Farhat,et al.  Time‐decomposed parallel time‐integrators: theory and feasibility studies for fluid, structure, and fluid–structure applications , 2003 .

[19]  J. Niesen,et al.  On the Global Error of Discretization Methods for Ordinary Differential Equations , 2004 .

[20]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[21]  J. Weir Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. , 2005, Journal of strength and conditioning research.

[22]  Jiguo Cao,et al.  Parameter estimation for differential equations: a generalized smoothing approach , 2007 .

[23]  杉本 剛 Philosophiae Naturalis Principia Mathematica邦訳書の底本に関するノート , 2010 .

[24]  E. Hairer,et al.  Solving Ordinary Differential Equations II , 2010 .

[25]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[26]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[27]  A. Chenciner Poincaré and the Three-Body Problem , 2015 .

[28]  D. Lathrop Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering , 2015 .

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[31]  Alex Graves,et al.  Memory-Efficient Backpropagation Through Time , 2016, NIPS.

[32]  N. Murata,et al.  Double Continuum Limit of Deep Neural Networks , 2017 .

[33]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[35]  W. Revelle psych: Procedures for Personality and Psychological Research , 2017 .

[36]  Yunjin Chen,et al.  Trainable Nonlinear Reaction Diffusion: A Flexible Framework for Fast and Effective Image Restoration , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Joshua B. Tenenbaum,et al.  End-to-End Differentiable Physics for Learning and Control , 2018, NeurIPS.

[38]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[39]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[40]  Anuj Karpatne,et al.  Physics Guided Recurrent Neural Networks For Modeling Dynamical Systems: Application to Monitoring Water Temperature And Quality In Lakes , 2018, ArXiv.

[41]  Eldad Haber,et al.  Reversible Architectures for Arbitrarily Deep Residual Neural Networks , 2017, AAAI.

[42]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[43]  David Duvenaud,et al.  Latent ODEs for Irregularly-Sampled Time Series , 2019, ArXiv.

[44]  Yee Whye Teh,et al.  Augmented Neural ODEs , 2019, NeurIPS.

[45]  Jonathan Masci,et al.  Accelerating Neural ODEs with Spectral Elements , 2019, ArXiv.

[46]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.

[47]  Kurt Keutzer,et al.  ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs , 2019, IJCAI.

[48]  Patrick Gallinari,et al.  Learning Dynamical Systems from Partial Observations , 2019, ArXiv.

[49]  Eldad Haber,et al.  Deep Neural Networks Motivated by Partial Differential Equations , 2018, Journal of Mathematical Imaging and Vision.

[50]  Jonathan Masci,et al.  SNODE: Spectral Discretization of Neural ODEs for System Identification , 2019, ICLR.

[51]  Philip G. Breen,et al.  Newton vs the machine: solving the chaotic three-body problem using deep neural networks , 2019, Monthly Notices of the Royal Astronomical Society.

[52]  George Em Karniadakis,et al.  Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness , 2019, Neural Networks.