Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales

Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action to include {\em options\/}---whole courses of behavior that may be temporally extended, stochastic, and contingent on events. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches or joint torques. Options may be given a priori, learned by experience, or both. They may be used interchangably with actions in a variety of planning and learning methods. The theory of semi-Markov decision processes (SMDPs) can be applied to model the consequences of options and to plan and learn with them. In this paper we develop these connections, building on prior work by Bradtke and Duff (1995), Parr (1998) and others. Our main novel results concern the interface between the MDP and SMDP levels of analysis. We show how a set of options can be altered by changing only their termination conditions to improve over SMDP methods with no additional cost. We also introduce {\it intra-option\/} temporal-difference methods that are able to learn from fragments of an option''s execution. Finally, we propose a notion of subgoal which can be used to improve the options themselves. Overall, we argue that options and their models provide hitherto missing aspects of a powerful, clear, and expressive framework for representing and organizing knowledge.

[1]  John R. Koza,et al.  Automatic Programming of Robots Using Genetic Programming , 1992, AAAI.

[2]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[3]  R. Korf Learning to solve problems by searching for macro-operators , 1983 .

[4]  Satinder Singh Transfer of Learning by Composing Solutions of Elemental Sequential Tasks , 1992, Mach. Learn..

[5]  Mark B. Ring Incremental Development of Complex Behaviors , 1991, ML.

[6]  Reid G. Simmons,et al.  Probabilistic Robot Navigation in Partially Observable Environments , 1995, IJCAI.

[7]  Roderic A. Grupen,et al.  A feedback control structure for on-line learning tasks , 1997, Robotics Auton. Syst..

[8]  Pattie Maes,et al.  A bottom-up mechanism for behavior selection in an artificial creature , 1991 .

[9]  Marco Colombetti,et al.  Robot Shaping: Developing Autonomous Agents Through Learning , 1994, Artif. Intell..

[10]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[11]  John N. Tsitsiklis,et al.  Reinforcement Learning for Call Admission Control and Routing in Integrated Service Networks , 1997, NIPS.

[12]  Maja J. Mataric,et al.  Behaviour-based control: examples from navigation, learning, and group behaviour , 1997, J. Exp. Theor. Artif. Intell..

[13]  Selahattin Kuru,et al.  Qualitative System Identification: Deriving Structure from Behavior , 1996, Artif. Intell..

[14]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[15]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[16]  Benjamin Kuipers,et al.  Common-Sense Knowledge of Space: Learning from Experience , 1979, IJCAI.

[17]  Paul R. Cohen,et al.  Concepts From Time Series , 1998, AAAI/IAAI.

[18]  Allen Newell,et al.  Human Problem Solving. , 1973 .

[19]  Chris Drummond,et al.  Composing Functions to Speed up Reinforcement Learning in a Changing World , 1998, ECML.

[20]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[21]  Lambert E. Wixson,et al.  Scaling Reinforcement Learning Techniques via Modularity , 1991, ML.

[22]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[23]  Marco Colombetti,et al.  Behavior analysis and training-a methodology for behavior engineering , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[24]  Satinder P. Singh,et al.  Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[25]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[26]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[27]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[28]  Johan de Kleer,et al.  A Qualitative Physics Based on Confluences , 1984, Artif. Intell..

[29]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[30]  Roger W. Brockett,et al.  Hybrid Models for Motion Control Systems , 1993 .

[31]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[32]  Robert L. Grossman,et al.  Timed Automata , 1999, CAV.

[33]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[34]  Russell Greiner,et al.  A Statistical Approach to Solving the EBL Utility Problem , 1992, AAAI.

[35]  Doina Precup,et al.  Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[36]  R. Sutton,et al.  Macro-Actions in Reinforcement Learning: An Empirical Analysis , 1998 .

[37]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[38]  Gerald DeJong,et al.  Learning to Plan in Continuous Domains , 1994, Artif. Intell..

[39]  Ronen I. Brafman,et al.  Prioritized Goal Decomposition of Markov Decision Processes: Toward a Synthesis of Classical and Decision Theoretic Planning , 1997, IJCAI.

[40]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[41]  Richard Fikes,et al.  Learning and Executing Generalized Robot Plans , 1993, Artif. Intell..

[42]  Richard S. Sutton,et al.  Roles of Macro-Actions in Accelerating Reinforcement Learning , 1998 .

[43]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[44]  Nils J. Nilsson,et al.  Teleo-Reactive Programs for Agent Control , 1993, J. Artif. Intell. Res..

[45]  Gary L. Drescher,et al.  Made-up minds - a constructivist approach to artificial intelligence , 1991 .

[46]  Eric A. Hansen,et al.  Cost-Effective Sensing during Plan Execution , 1994, AAAI.

[47]  Earl D. Sacerdoti,et al.  Planning in a Hierarchy of Abstraction Spaces , 1974, IJCAI.

[48]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[49]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[50]  S. Haykin,et al.  A Q-learning-based dynamic channel assignment technique for mobile communication systems , 1999 .

[51]  Minoru Asada,et al.  Behavior coordination for a mobile robot using modular reinforcement learning , 1996, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS '96.

[52]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[53]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[54]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[55]  Satinder P. Singh,et al.  The Efficient Learning of Multiple Task Sequences , 1991, NIPS.

[56]  Nils J. Nilsson,et al.  A Hierarchical Robot Planning and Execution System. , 1973 .

[57]  L. Chrisman Reasoning About Probabilistic Actions At Multiple Levels of Granularity , 1994 .

[58]  David Ruby,et al.  Learning Episodes for Optimization , 1992, ML.

[59]  Ronen I. Brafman,et al.  Modeling Agents as Qualitative Decision Makers , 1997, Artif. Intell..

[60]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[61]  Jonas Karlsson,et al.  Learning via task decomposition , 1993 .

[62]  Rodney A. Brooks,et al.  Learning to Coordinate Behaviors , 1990, AAAI.

[63]  Karen Zita Haigh,et al.  Exploiting domain geometry in analogical route planning , 1997, J. Exp. Theor. Artif. Intell..

[64]  Roderic A. Grupen,et al.  Robust Reinforcement Learning in Motion Planning , 1993, NIPS.

[65]  Richard E. Korf,et al.  Planning as Search: A Quantitative Approach , 1987, Artif. Intell..

[66]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.