Simple and Fast Calculation of the Second-Order Gradients for Globalized Dual Heuristic Dynamic Programming in Neural Networks

We derive an algorithm to exactly calculate the mixed second-order derivatives of a neural network's output with respect to its input vector and weight vector. This is necessary for the adaptive dynamic programming (ADP) algorithms globalized dual heuristic programming (GDHP) and value-gradient learning. The algorithm calculates the inner product of this second-order matrix with a given fixed vector in a time that is linear in the number of weights in the neural network. We use a “forward accumulation” of the derivative calculations which produces a much more elegant and easy-to-implement solution than has previously been published for this task. In doing so, the algorithm makes GDHP simple to implement and efficient, bridging the gap between the widely used DHP and GDHP ADP methods.

[1]  Louis B. Rall,et al.  Automatic Differentiation: Techniques and Applications , 1981, Lecture Notes in Computer Science.

[2]  Michael Fairbank,et al.  The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning , 2011, ArXiv.

[3]  George G. Lendaris,et al.  Training strategies for critic and action neural networks in dual heuristic programming method , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[4]  Alexander Linden,et al.  Inversion of neural networks by gradient descent , 1990, Parallel Comput..

[5]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[6]  Zhong-Ping Jiang,et al.  Approximate Dynamic Programming for Optimal Stationary Control With Control-Dependent Noise , 2011, IEEE Transactions on Neural Networks.

[7]  Haibo He,et al.  Adaptive Learning and Control for MIMO System Based on Adaptive Dynamic Programming , 2011, IEEE Transactions on Neural Networks.

[8]  R. Bellman Dynamic programming. , 1957, Science.

[9]  D. Liu,et al.  Adaptive Dynamic Programming for Finite-Horizon Optimal Control of Discrete-Time Nonlinear Systems With $\varepsilon$-Error Bound , 2011, IEEE Transactions on Neural Networks.

[10]  Huaguang Zhang,et al.  Adaptive Dynamic Programming: An Introduction , 2009, IEEE Computational Intelligence Magazine.

[11]  Rémi Coulom,et al.  Reinforcement Learning Using Neural Networks, with Applications to Motor Control. (Apprentissage par renforcement utilisant des réseaux de neurones, avec des applications au contrôle moteur) , 2002 .

[12]  M. F. Møller,et al.  Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time , 1993 .

[13]  Paul J. Werbos,et al.  Approximate dynamic programming for real-time control and neural modeling , 1992 .

[14]  Michael Fairbank,et al.  Value-gradient learning , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[15]  S. F. R. F. Stengel 3 Model-Based Adaptive Critic Designs , 2004 .

[16]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[17]  Paul J. Werbos,et al.  Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[18]  Khan M. Iftekharuddin,et al.  Transformation Invariant On-Line Target Recognition , 2011, IEEE Transactions on Neural Networks.

[19]  Michael Fairbank,et al.  Reinforcement Learning by Value Gradients , 2008, ArXiv.

[20]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[21]  P. Werbos Backwards Differentiation in AD and Neural Nets: Past Links and New Opportunities , 2006 .

[22]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.