Using Options for Knowledge Transfer in Reinforcement Learning

One of the original motivations for the use of temporally extended actions, or options, in reinforcement learning was to enable t he transfer of learned value functions or policies to new problems. Many experimenters have used options to speed learning on single problems, but options have not been studied in depth as a tool for transfer. In this paper we introduce a formal model of a learning problem as a distribution of Markov Decision Problems (MDPs). Each MDP represents a task the agent will have to solve. Our model can also be viewed as a partially observable Markov decision problem (POMDP), with a special structure that we describe. We study two learning algorithms, one which keeps a single value function that generalizes across tasks, and an increm ental POMDPinspired method maintaining separate value functions for each task. We evaluate the learning algorithms on an extension of the Mountain Car domain, in terms of both learning speed and asymptotic performance. Empirically, we find that temporally extended options can fa cilitate transfer for both algorithms. In our domain, the single value func tion algorithm has much better learning speed because it generalizes its ex perience more broadly across tasks. We also observe that different sets of options can achieve tradeoffs of learning speed versus asymptotic perf ormance.

[1]  Csaba Szepesvári,et al.  An Evaluation Criterion for Macro-Learning and Some Results , 1999 .

[2]  Chris Drummond,et al.  Composing Functions to Speed up Reinforcement Learning in a Changing World , 1998, ECML.

[3]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[4]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[5]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[6]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[7]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[8]  Daniel S. Bernstein,et al.  Reusing Old Policies to Accelerate Learning on New MDPs , 1999 .

[9]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[10]  Satinder P. Singh,et al.  Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[11]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[12]  Doina Precup,et al.  Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales , 1998 .

[13]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[14]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[15]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[16]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[17]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[18]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[20]  Doina Precup,et al.  Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[21]  Richard S. Sutton,et al.  Roles of Macro-Actions in Accelerating Reinforcement Learning , 1998 .