论文信息 - Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE

Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE

Neural ordinary differential equations (NODEs) have recently attracted increasing attention; however, their empirical performance on benchmark tasks (e.g. image classification) are significantly inferior to discrete-layer models. We demonstrate an explanation for their poorer performance is the inaccuracy of existing gradient estimation methods: the adjoint method has numerical errors in reverse-mode integration; the naive method directly back-propagates through ODE solvers, but suffers from a redundantly deep computation graph when searching for the optimal stepsize. We propose the Adaptive Checkpoint Adjoint (ACA) method: in automatic differentiation, ACA applies a trajectory checkpoint strategy which records the forward-mode trajectory as the reverse-mode trajectory to guarantee accuracy; ACA deletes redundant components for shallow computation graphs; and ACA supports adaptive solvers. On image classification tasks, compared with the adjoint and naive method, ACA achieves half the error rate in half the training time; NODE trained with ACA outperforms ResNet in both accuracy and test-retest reliability. On time-series modeling, ACA outperforms competing methods. Finally, in an example of the three-body problem, we show NODE with ACA can incorporate physical knowledge to achieve better accuracy. We provide the PyTorch implementation of ACA: https://github.com/juntang-zhuang/torch-ACA.

[1] H. H. Rosenbrock,et al. Some general implicit processes for the numerical solution of differential equations , 1963, Comput. J..

[2] M. L. Chambers. The Mathematical Theory of Optimal Processes , 1965 .

[3] H. Hermes,et al. Foundations of optimal control theory , 1968 .

[4] A. Hindmarsh. LSODE and LSODI, two new initial value ordinary differential equation solvers , 1980, SGNM.

[5] J. Dormand,et al. A family of embedded Runge-Kutta formulae , 1980 .

[6] D. S. Jones,et al. Differential Equations and Mathematical Biology , 1983 .

[7] A. Patera. A spectral element method for fluid dynamics: Laminar flow in a channel expansion , 1984 .

[8] D. Altman,et al. STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT , 1986, The Lancet.

[9] J M Bland,et al. Statistical methods for assessing agreement between two methods of clinical measurement , 1986 .

[10] F. Krogh,et al. Solving Ordinary Differential Equations , 2019, Programming for Computations - Python.

[11] G. D. Byrne,et al. VODE: a variable-coefficient ODE solver , 1989 .

[12] M Davies,et al. The Structured Clinical Interview for DSM-III-R (SCID). II. Multisite test-retest reliability. , 1992 .

[13] A. M. Lyapunov. The general problem of the stability of motion , 1992 .

[14] J. Barrow-Green. Poincare and the Three Body Problem , 1996 .

[15] R. Geroch. Partial Differential Equations of Physics , 1996, gr-qc/9602055.

[16] B. Wilamowski,et al. Hamiltonian neural nets as a universal signal processor , 2002, IEEE 2002 28th Annual Conference of the Industrial Electronics Society. IECON 02.

[17] William H. Press,et al. Numerical recipes in C , 2002 .

[18] Charbel Farhat,et al. Time‐decomposed parallel time‐integrators: theory and feasibility studies for fluid, structure, and fluid–structure applications , 2003 .

[19] J. Niesen,et al. On the Global Error of Discretization Methods for Ordinary Differential Equations , 2004 .

[20] Yurii Nesterov,et al. Smooth minimization of non-smooth functions , 2005, Math. Program..

[21] J. Weir. Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. , 2005, Journal of strength and conditioning research.

[22] Jiguo Cao,et al. Parameter estimation for differential equations: a generalized smoothing approach , 2007 .

[23] 杉本剛. Philosophiae Naturalis Principia Mathematica邦訳書の底本に関するノート , 2010 .

[24] E. Hairer,et al. Solving Ordinary Differential Equations II , 2010 .

[25] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[26] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[27] A. Chenciner. Poincaré and the Three-Body Problem , 2015 .

[28] D. Lathrop. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering , 2015 .

[29] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[31] Alex Graves,et al. Memory-Efficient Backpropagation Through Time , 2016, NIPS.

[32] N. Murata,et al. Double Continuum Limit of Deep Neural Networks , 2017 .

[33] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] E Weinan,et al. A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[35] W. Revelle. psych: Procedures for Personality and Psychological Research , 2017 .

[36] Yunjin Chen,et al. Trainable Nonlinear Reaction Diffusion: A Flexible Framework for Fast and Effective Image Restoration , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] Joshua B. Tenenbaum,et al. End-to-End Differentiable Physics for Learning and Control , 2018, NeurIPS.

[38] Bin Dong,et al. Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[39] David Duvenaud,et al. Neural Ordinary Differential Equations , 2018, NeurIPS.

[40] Anuj Karpatne,et al. Physics Guided Recurrent Neural Networks For Modeling Dynamical Systems: Application to Monitoring Water Temperature And Quality In Lakes , 2018, ArXiv.

[41] Eldad Haber,et al. Reversible Architectures for Arbitrarily Deep Residual Neural Networks , 2017, AAAI.

[42] Yuval Tassa,et al. DeepMind Control Suite , 2018, ArXiv.

[43] David Duvenaud,et al. Latent ODEs for Irregularly-Sampled Time Series , 2019, ArXiv.

[44] Yee Whye Teh,et al. Augmented Neural ODEs , 2019, NeurIPS.

[45] Jonathan Masci,et al. Accelerating Neural ODEs with Spectral Elements , 2019, ArXiv.

[46] David Duvenaud,et al. FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.

[47] Kurt Keutzer,et al. ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs , 2019, IJCAI.

[48] Patrick Gallinari,et al. Learning Dynamical Systems from Partial Observations , 2019, ArXiv.

[49] Eldad Haber,et al. Deep Neural Networks Motivated by Partial Differential Equations , 2018, Journal of Mathematical Imaging and Vision.

[50] Jonathan Masci,et al. SNODE: Spectral Discretization of Neural ODEs for System Identification , 2019, ICLR.

[51] Philip G. Breen,et al. Newton vs the machine: solving the chaotic three-body problem using deep neural networks , 2019, Monthly Notices of the Royal Astronomical Society.

[52] George Em Karniadakis,et al. Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness , 2019, Neural Networks.