Hamiltonian Deep Neural Networks Guaranteeing Non-vanishing Gradients by Design

Deep Neural Networks (DNNs) training can be difficult due to vanishing and exploding gradients during weight optimization through backpropagation. To address this problem, we propose a general class of Hamiltonian DNNs (H-DNNs) that stem from the discretization of continuous-time Hamiltonian systems and include several existing architectures based on ordinary differential equations. Our main result is that a broad set of H-DNNs ensures non-vanishing gradients by design for an arbitrary network depth. This is obtained by proving that, using a semiimplicit Euler discretization scheme, the backward sensitivity matrices involved in gradient computations are symplectic. We also provide an upper bound to the magnitude of sensitivity matrices, and show that exploding gradients can be either controlled through regularization or avoided for special architectures. Finally, we enable distributed implementations of backward and forward propagation algorithms in H-DNNs by characterizing appropriate sparsity constraints on the weight matrices. The good performance of H-DNNs is demonstrated on benchmark classification problems, including image classification with the MNIST dataset.

[1]  Dacheng Tao,et al.  Improving Training of Deep Neural Networks via Singular Value Bounding , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Karolj Skala,et al.  Scalable Distributed Computing Hierarchy: Cloud, Fog and Dew Computing , 2015, Open J. Cloud Comput..

[3]  H. T. Kung,et al.  Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[4]  Eldad Haber,et al.  Reversible Architectures for Arbitrarily Deep Residual Neural Networks , 2017, AAAI.

[5]  E. Hairer,et al.  Geometric Numerical Integration: Structure Preserving Algorithms for Ordinary Differential Equations , 2004 .

[6]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[7]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[8]  Frank Allgöwer,et al.  Training Robust Neural Networks Using Lipschitz Bounds , 2020, IEEE Control Systems Letters.

[9]  Daizhan Cheng,et al.  Stabilization of time-varying Hamiltonian systems , 2006, IEEE Transactions on Control Systems Technology.

[10]  Ed H. Chi,et al.  AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks , 2019, ICLR.

[11]  Benjamin Karg,et al.  A deep learning-based approach to robust nonlinear model predictive control , 2018 .

[12]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Lars Ruthotto,et al.  Learning Across Scales - Multiscale Methods for Convolution Neural Networks , 2018, AAAI.

[14]  Zhiyuan Liu,et al.  Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[15]  Laurent El Ghaoui,et al.  Implicit Deep Learning , 2019, SIAM J. Math. Data Sci..

[16]  P. Melchior Poincaré, H. - Les méthodes nouvelles de la Mécanique Céleste , 1958 .

[17]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[18]  Fernando Gama,et al.  Distributed Linear-Quadratic Control with Graph Neural Networks , 2021, ArXiv.

[19]  Min-Yen Wu,et al.  A note on stability of linear time-varying systems , 1974 .

[20]  Nikolai Matni,et al.  Communication Topology Co-Design in Graph Recurrent Neural Network based Distributed Control , 2021, 2021 60th IEEE Conference on Decision and Control (CDC).

[21]  P. van der Kloet,et al.  On Characteristic Equations, Dynamic Eigenvalues, Lyapunov Exponents and Floquet Numbers for Linear Time-Varying Systems , 2022 .

[22]  Honglak Lee,et al.  An Ode to an ODE , 2020, ArXiv.

[23]  Maryam Kamgarpour,et al.  On Separable Quadratic Lyapunov Functions for Convex Design of Distributed Controllers , 2019, ArXiv.

[24]  Christopher Joseph Pal,et al.  On orthogonality and learning recurrent networks with long term dependencies , 2017, ICML.

[25]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[26]  Diana Bohm,et al.  L2 Gain And Passivity Techniques In Nonlinear Control , 2016 .

[27]  Manfred Morari,et al.  Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks , 2019, NeurIPS.

[28]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[29]  M. D. Gosson Symplectic Methods in Harmonic Analysis and in Mathematical Physics , 2011 .

[30]  Maryam Kamgarpour,et al.  Sparsity Invariance for Convex Design of Distributed Controllers , 2019, IEEE Transactions on Control of Network Systems.

[31]  Thomas Parisini,et al.  Neural Approximations for Optimal Control and Decision , 2019 .

[32]  Qiang Ye,et al.  Orthogonal Recurrent Neural Networks with Scaled Cayley Transform , 2017, ICML.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yee Whye Teh,et al.  Augmented Neural ODEs , 2019, NeurIPS.

[35]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[36]  Inderjit S. Dhillon,et al.  Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization , 2018, ICML.

[37]  Jason Yosinski,et al.  Hamiltonian Neural Networks , 2019, NeurIPS.