Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces

A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the long-term utility or value of any given state. The function is important because an agent can use this measure to decide what to do next. A common problem in reinforcement learning when applied to systems having continuous states and action spaces is that the value function must operate with a domain consisting of real-valued variables, which means that it should be able to represent the value of infinitely many state and action pairs. For this reason, function approximators are used to represent the value function when a close-form solution of the optimal policy is not available. In this article, we extend a previously proposed reinforcement learning algorithm so that it can be used with function approximators that generalize the value of individual experiences across both state and action spaces. In particular, we discuss the benefits of using sparse coarse-coded function approximators to represent value functions and describe in detail three implementations: cerebellar model articulation controllers, instance-based, and case-based. Additionally, we discuss how function approximators having different degrees of resolution in different regions of the state and action spaces may influence the performance and learning efficiency of the agent. We propose a simple and modular technique that can be used to implement function approximators with nonuniform degrees of resolution so that the value function can be represented with higher accuracy in important regions of the state and action spaces. We performed extensive experiments in the double-integrator and pendulum swing-up systems to demonstrate the proposed ideas. '

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  James S. Albus,et al.  I A New Approach to Manipulator Control: The I Cerebellar Model Articulation Controller , 1975 .

[3]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[4]  K. S. Shanmugam,et al.  Digital and analog communication systems , 1979 .

[5]  R. J. Richards,et al.  An Introduction to Dynamics and Control , 1979 .

[6]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[7]  Anuradha M. Annaswamy,et al.  Stable Adaptive Systems , 1989 .

[8]  David W. Aha,et al.  Instance‐based prediction of real‐valued attributes , 1989, Comput. Intell..

[9]  Sridhar Mahadevan,et al.  Scaling Reinforcement Learning to Robotics by Exploiting the Subsumption Architecture , 1991, ML.

[10]  Christopher G. Atkeson,et al.  Memory-Based Learning Control , 1991, 1991 American Control Conference.

[11]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[12]  Pentti Kanerva,et al.  Sparse distributed memory and related models , 1993 .

[13]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[14]  Andrew McCallum,et al.  Instance-Based State Identification for Reinforcement Learning , 1994, NIPS.

[15]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine-mediated learning.

[16]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[17]  Chen K. Tham,et al.  Reinforcement learning of multiple tasks using a hierarchical CMAC architecture , 1995, Robotics Auton. Syst..

[18]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[19]  Pawel Cichosz,et al.  Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning , 1994, J. Artif. Intell. Res..

[20]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[21]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[22]  Pawea Cichosz Truncating Temporal Diierences: on the Eecient Implementation of Td for Reinforcement Learning , 1995 .

[23]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[24]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[25]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[26]  Martin C. Cooper Fundamental Properties of Neighbourhood Substitution in Constraint Satisfaction Problems , 1997, Artif. Intell..

[27]  Ashwin Ram,et al.  Continuous Case-Based Reasoning , 1997, Artif. Intell..