Integrated Learning and Planning Based on Truncating Temporal Differences

Reinforcement learning systems learn to act in an uncertain environment by executing actions and observing their long-term effects. A large number of time steps may be required before this trial- and-error process converges to a satisfactory policy. It is highly desirable that the number of experiences needed by the system to learn to perform its task be minimized, particularly if making errors costs much. One approach to achieve this goal is to use hypothetical experiences, which requires some additional computation, but may reduce the necessary number of much more costly real experiences. This well-known idea of augmenting reinforcement learning by planning is revisited in this paper in the context of truncated TD(λ), or TTD, a simple computational technique which allows reinforcement learning algorithms based on the methods of temporal differences to learn considerably faster with essentially no additional computational expense. Two different ways of combining TTD with planning are proposed which make it possible to benefit from λ>0 in both the learning and planning processes. The algorithms are evaluated experimentally on a family of grid path-finding tasks and shown to indeed yield a considerable reduction of the number of real interactions with the environment necessary to converge, as well as an improvement of scaling properties.

[1]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[2]  Pawel Cichosz,et al.  Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning , 1994, J. Artif. Intell. Res..

[3]  Pawea Cichosz,et al.  Truncating Temporal Diierences: on the Eecient Implementation of Td() for Reinforcement Learning , 1995 .

[4]  P. Cichosz,et al.  Truncated Temporal Diierences with Function Approximation: Successful Examples Using Cmac , 1996 .

[5]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[6]  J. Mulawka Fast and Eecient Reinforcement Learning with Truncated Temporal Diierences , 1995 .

[7]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[8]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[10]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[11]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[12]  Pawel Cichosz,et al.  Fast and Efficient Reinforcement Learning with Truncated Temporal Differences , 1995, ICML.

[13]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[14]  J. Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.