Reinforcement Learning with a Hierarchy of Abstract Models

Reinforcement learning (RL) algorithms have traditionally been thought of as trial and error learning methods that use actual control experience to incrementally improve a control policy. Sutton's DYNA architecture demonstrated that RL algorithms can work as well using simulated experience from an environment model, and that the resulting computation was similar to doing one-step lookahead planning. Inspired by the literature on hierarchical planning, I propose learning a hierarchy of models of the environment that abstract temporal detail as a means of improving the scalability of RL algorithms. I present H-DYNA (Hierarchical DYNA), an extension to Sutton's DYNA architecture that is able to learn such a hierarchy of abstract models. H-DYNA differs from hierarchical planners in two ways: first, the abstract models are learned using experience gained while learning to solve other tasks in the same environment, and second, the abstract models can be used to solve stochastic control tasks. Simulations on a set of compositionally-structured navigation tasks show that H-DYNA can learn to solve them faster than conventional RL algorithms. The abstract models also serve as mechanisms for achieving transfer of learning across multiple tasks.

[1]  Mark S. Boddy,et al.  An Analysis of Time-Dependent Planning , 1988, AAAI.

[2]  Earl D. Sacerdott Planning in a hierarchy of abstraction spaces , 1973, IJCAI 1973.

[3]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[4]  R. Korf Learning to solve problems by searching for macro-operators , 1983 .

[5]  Satinder Singh Transfer of Learning by Composing Solutions of Elemental Sequential Tasks , 1992, Mach. Learn..

[6]  Richard S. Sutton,et al.  Learning and Sequential Decision Making , 1989 .

[7]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[8]  Richard S. Sutton,et al.  Planning by Incremental Dynamic Programming , 1991, ML.

[9]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[10]  Steven D. Whitehead,et al.  Complexity and Cooperation in Q-Learning , 1991, ML.

[11]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[12]  Andrew G. Barto,et al.  On the Computational Economics of Reinforcement Learning , 1991 .

[13]  Marcel Schoppers,et al.  Universal Plans for Reactive Robots in Unpredictable Environments , 1987, IJCAI.

[14]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[15]  Glenn A. Iba,et al.  A heuristic approach to the discovery of macro-operators , 2004, Machine Learning.

[16]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[17]  Satinder P. Singh,et al.  Scaling Reinforcement Learning Algorithms by Learning Variable Temporal Resolution Models , 1992, ML.