The Importance of Clipping in Neurocontrol by Direct Gradient Descent on the Cost-to-Go Function and in Adaptive Dynamic Programming

In adaptive dynamic programming, neurocontrol and reinforcement learning, the objective is for an agent to learn to choose actions so as to minimise a total cost function. In this paper we show that when discretized time is used to model the motion of the agent, it can be very important to do "clipping" on the motion of the agent in the final time step of the trajectory. By clipping we mean that the final time step of the trajectory is to be truncated such that the agent stops exactly at the first terminal state reached, and no distance further. We demonstrate that when clipping is omitted, learning performance can fail to reach the optimum; and when clipping is done properly, learning performance can improve significantly. The clipping problem we describe affects algorithms which use explicit derivatives of the model functions of the environment to calculate a learning gradient. These include Backpropagation Through Time for Control, and methods based on Dual Heuristic Dynamic Programming. However the clipping problem does not significantly affect methods based on Heuristic Dynamic Programming, Temporal Differences or Policy Gradient Learning algorithms. Similarly, the clipping problem does not affect fixed-length finite-horizon problems.

[1]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  Michael Fairbank,et al.  Approximating Optimal Control with Value Gradient Learning , 2013 .

[3]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[4]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[5]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[6]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[7]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  Michael Fairbank,et al.  A comparison of learning speed and ability to cope without exploration between DHP and TD(0) , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[9]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[10]  Paul J. Werbos,et al.  Approximate dynamic programming for real-time control and neural modeling , 1992 .

[11]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[12]  S. F. R. F. Stengel 3 Model-Based Adaptive Critic Designs , 2004 .

[13]  Michael Fairbank,et al.  Reinforcement Learning by Value Gradients , 2008, ArXiv.

[14]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[15]  Michael Fairbank,et al.  Simple and Fast Calculation of the Second-Order Gradients for Globalized Dual Heuristic Dynamic Programming in Neural Networks , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Jennie Si,et al.  Backpropagation Through Time and Derivative Adaptive CriticsA Common Framework for ComparisonPortions of this chapter were previously published in [4, 7,9, 1214,23]. , 2004 .

[17]  Michael Fairbank,et al.  Value-gradient learning , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[18]  Etienne Barnard,et al.  Temporal-difference methods and Markov models , 1993, IEEE Trans. Syst. Man Cybern..

[19]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[20]  Michael Fairbank,et al.  The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning , 2011, ArXiv.

[21]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[22]  Razvan V. Florian,et al.  Correct equations for the dynamics of the cart-pole system , 2005 .

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Frank L. Lewis,et al.  Reinforcement Learning and Approximate Dynamic Programming for Feedback Control , 2012 .

[25]  Huaguang Zhang,et al.  Adaptive Dynamic Programming: An Introduction , 2009, IEEE Computational Intelligence Magazine.

[26]  George G. Lendaris,et al.  Training strategies for critic and action neural networks in dual heuristic programming method , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).