Chaotic dynamics and convergence analysis of temporal difference algorithms with bang‐bang control

SUMMARY Reinforcement learning is a powerful tool used to obtain optimal control solutions for complex and difficult sequential decision making problems where only a minimal amount of a priori knowledge exists about the system dynamics. As such, it has also been used as a model of cognitive learning in humans and applied to systems, such as humanoid robots, to study embodied cognition. In this paper, a different approach is taken where a simple test problem is used to investigate issues associated with the value function's representation and parametric convergence. In particular, the terminal convergence problem is analyzed with a known optimal control policy where the aim is to accurately learn the value function. For certain initial conditions, the value function is explicitly calculated and it is shown to have a polynomial form. It is parameterized by terms that are functions of the unknown plant's parameters and the value function's discount factor, and their convergence properties are analyzed. It is shown that the temporal difference error introduces a null space associated with the finite horizon basis function during the experiment. The learning problem is only non-singular when the experiment termination is handled correctly and a number of (equivalent) solutions are described. Finally, it is demonstrated that, in general, the test problem's dynamics are chaotic for random initial states and this causes digital offset in the value function learning. The offset is calculated, and a dead zone is defined to switch off learning in the chaotic region. Copyright © 2015 John Wiley & Sons, Ltd.

[1]  John N. Tsitsiklis,et al.  On the Convergence of Optimistic Policy Iteration , 2002, J. Mach. Learn. Res..

[2]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[3]  Lihong Li,et al.  A worst-case comparison between temporal difference and residual gradient with linear function approximation , 2008, ICML '08.

[4]  Dimitri P. Bertsekas,et al.  Temporal Difference Methods for General Projected Equations , 2011, IEEE Transactions on Automatic Control.

[5]  F.L. Lewis,et al.  Reinforcement learning and adaptive dynamic programming for feedback control , 2009, IEEE Circuits and Systems Magazine.

[6]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[7]  Bart De Schutter,et al.  Approximate reinforcement learning: An overview , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[8]  Paul Glendinning Stability, Instability and Chaos: INTRODUCTION , 1994 .

[9]  Sabino Gadaleta,et al.  Reinforcement learning chaos control using value sensitive vector-quantization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[10]  Sanjay S. Joshi,et al.  Reinforcement Learning Solution to a Benchmark Time- Optimal Control Problem , 2007 .

[11]  Manfred K. Warmuth,et al.  On the Worst-Case Analysis of Temporal-Difference Learning Algorithms , 2005, Machine Learning.

[12]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[13]  Artur Merke,et al.  TD(0) Converges Provably Faster than the Residual Gradient Algorithm , 2003, ICML.

[14]  Piotr Kowalczyk,et al.  Micro-chaotic dynamics due to digital sampling in hybrid systems of Filippov type , 2010 .

[15]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[16]  Peter Stone,et al.  Function Approximation via Tile Coding: Automating Parameter Choice , 2005, SARA.

[17]  Matthieu Geist,et al.  Parametric value function approximation: A unified view , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[18]  P. Dayan,et al.  Reward, Motivation, and Reinforcement Learning , 2002, Neuron.

[19]  Chuen-Tsai Sun,et al.  Functional equivalence between radial basis function networks and fuzzy inference systems , 1993, IEEE Trans. Neural Networks.

[20]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.