Between MDPs and Semi-MDPs : Learning , Planning , and Representing Knowledge at Multiple Temporal Scales

Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action to include options—whole courses of behavior that may be temporally extended, stochastic, and contingent on events. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Options may be given a priori, learned by experience, or both. They may be used interchangeably with actions in a variety of planning and learning methods. The theory of semi-Markov decision processes (SMDPs) can be applied to model the consequences of options and as a basis for planning and learning methods using them. In this paper we develop these connections, building on prior work by Bradtke and Duff (1995), Parr (in prep.) and others. Our main novel results concern the interface between the MDP and SMDP levels of analysis. We show how a set of options can be altered by changing only their termination conditions to improve over SMDP methods with no additional cost. We also introduce intra-option temporal-difference methods that are able to learn from fragments of an option’s execution. Finally, we propose a notion of subgoal which can be used to improve the options themselves. Overall, we argue that options and their models provide hitherto missing aspects of a powerful, clear, and expressive framework for representing and organizing knowledge. 1. Temporal Abstraction To make everyday decisions, people must foresee the consequences of their possible courses of action at multiple levels of temporal abstraction. Consider a traveler deciding to undertake a journey to a distant city. To decide whether or not to go, the benefits of the trip must be weighed against the expense. Having decided to go, choices must be made at each leg, e.g., whether to fly or to drive, whether to take a taxi or to arrange a ride. Each of these steps involves foresight and decision, all the way down to the smallest of actions. For example, just to call a taxi may involve finding a telephone, dialing each digit, and the individual muscle contractions to lift the receiver to the ear. Human decision making routinely involves c ©1998 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved. Sutton, Precup, & Singh (Revised) planning and foresight—choice among temporally-extended options—over a broad range of time scales. In this paper we examine the nature of the knowledge needed to plan and learn at multiple levels of temporal abstraction. The principal knowledge needed is the ability to predict the consequences of different courses of action. This may seem straightforward, but it is not. It is not at all clear what we mean either by a “course of action” or, particularly, by “its consequences”. One problem is that most courses of action have many consequences, with the immediate consequences different from the longer-term ones. For example, the course of action go-to-the-librarymay have the near-term consequence of being outdoors and walking, and the long-term consequence of being indoors and reading. In addition, we usually only consider courses of action for a limited but indefinite time period. An action like wash-the-car is most usefully executed up until the car is clean, but without specifying a particular time at which it is to stop. We seek a way of representing predictive knowledge that is: Expressive The representation must be able to include basic kinds of commonsense knowledge such as the examples we have mentioned. In particular, it should be able to predict consequences that are temporally extended and uncertain. This criterion rules out many conventional engineering representations, such as differential equations and transition probabilities. The representation should also be able to predict the consequences of courses of action that are stochastic and contingent on subsequent observations. This rules out simple sequences of action with a deterministically known outcome, such as conventional macro-operators. Clear The representation should be clear, explicit, and grounded in primitive observations and actions. Ideally it would be expressed in a formal mathematical language. Any predictions made should be testable simply by comparing them against data: no human interpretation should be necessary. This criterion rules out conventional AI representations with ungrounded symbols. For example, “Tweety is a bird” relies on people to understand “Tweety,” “Bird,” and “is-a”; none of these has a clear interpretation in terms of observables. A related criterion is that the representation should be learnable. Only a representation that is clear and directly testable from observables is likely to be learnable. A clear representation need not be unambiguous. For example, it could predict that one of two events will occur at a particular time, but not specify which of them will occur. Suitable for Planning A representation of knowledge must be suitable for how it will be used as part of planning and decision-making. In particular, the representation should enable interrelating and intermixing knowledge at different levels of temporal abstraction. It should be clear that we are addressing a fundamental question of AI: how should an intelligent agent represent its knowledge of the world? We are interested here in the underlying semantics of the knowledge, not with its surface form. In particular, we are not concerned with the data structures of the knowledge representation, e.g., whether the

[1]  Richard Fikes,et al.  Learning and Executing Generalized Robot Plans , 1993, Artif. Intell..

[2]  Earl D. Sacerdoti,et al.  Planning in a Hierarchy of Abstraction Spaces , 1974, IJCAI.

[3]  Allen Newell,et al.  Human Problem Solving. , 1973 .

[4]  Nils J. Nilsson,et al.  A Hierarchical Robot Planning and Execution System. , 1973 .

[5]  Benjamin Kuipers,et al.  Common-Sense Knowledge of Space: Learning from Experience , 1979, IJCAI.

[6]  R. Korf Learning to solve problems by searching for macro-operators , 1983 .

[7]  Johan de Kleer,et al.  A Qualitative Physics Based on Confluences , 1984, Artif. Intell..

[8]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[9]  Richard E. Korf,et al.  Planning as Search: A Quantitative Approach , 1987, Artif. Intell..

[10]  Rodney A. Brooks,et al.  Learning to Coordinate Behaviors , 1990, AAAI.

[11]  Lambert E. Wixson,et al.  Scaling Reinforcement Learning Techniques via Modularity , 1991, ML.

[12]  Pattie Maes,et al.  A bottom-up mechanism for behavior selection in an artificial creature , 1991 .

[13]  Gary L. Drescher,et al.  Made-up minds - a constructivist approach to artificial intelligence , 1991 .

[14]  J. Urgen Schmidhuber,et al.  Neural sequence chunkers , 1991, Forschungsberichte, TU Munich.

[15]  Satinder P. Singh,et al.  The Efficient Learning of Multiple Task Sequences , 1991, NIPS.

[16]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[17]  John R. Koza,et al.  Automatic Programming of Robots Using Genetic Programming , 1992, AAAI.

[18]  David Ruby,et al.  Learning Episodes for Optimization , 1992, ML.

[19]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[20]  Russell Greiner,et al.  A Statistical Approach to Solving the EBL Utility Problem , 1992, AAAI.

[21]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[22]  Satinder P. Singh,et al.  Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[23]  Roger W. Brockett,et al.  Hybrid Models for Motion Control Systems , 1993 .

[24]  Roderic A. Grupen,et al.  Robust Reinforcement Learning in Motion Planning , 1993, NIPS.

[25]  Robert L. Grossman,et al.  Timed Automata , 1999, CAV.

[26]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[27]  Jonas Karlsson,et al.  Learning via task decomposition , 1993 .

[28]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[29]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[30]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[31]  L. Chrisman Reasoning About Probabilistic Actions At Multiple Levels of Granularity , 1994 .

[32]  Marco Colombetti,et al.  Robot Shaping: Developing Autonomous Agents Through Learning , 1994, Artif. Intell..

[33]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[34]  Nils J. Nilsson,et al.  Teleo-Reactive Programs for Agent Control , 1993, J. Artif. Intell. Res..

[35]  Eric A. Hansen,et al.  Cost-Effective Sensing during Plan Execution , 1994, AAAI.

[36]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[37]  Gerald DeJong,et al.  Learning to Plan in Continuous Domains , 1994, Artif. Intell..

[38]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[39]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[40]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[41]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[42]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[43]  Reid G. Simmons,et al.  Probabilistic Robot Navigation in Partially Observable Environments , 1995, IJCAI.

[44]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[45]  David B. Leake,et al.  Quantitative Results Concerning the Utility of Explanation-Based Learning , 1995 .

[46]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[47]  Selahattin Kuru,et al.  Qualitative System Identification: Deriving Structure from Behavior , 1996, Artif. Intell..

[48]  Roderic A. Grupen,et al.  Learning Control Composition in a Complex Environment , 1996 .

[49]  Minoru Asada,et al.  Behavior coordination for a mobile robot using modular reinforcement learning , 1996, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS '96.

[50]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[51]  Marco Colombetti,et al.  Behavior analysis and training-a methodology for behavior engineering , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[52]  John N. Tsitsiklis,et al.  Reinforcement Learning for Call Admission Control and Routing in Integrated Service Networks , 1997, NIPS.

[53]  Ronen I. Brafman,et al.  Prioritized Goal Decomposition of Markov Decision Processes: Toward a Synthesis of Classical and Decision Theoretic Planning , 1997, IJCAI.

[54]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[55]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[56]  Richard S. Sutton,et al.  Roles of Macro-Actions in Accelerating Reinforcement Learning , 1998 .

[57]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[58]  Maja J. Matari,et al.  Behavior-based Control: Examples from Navigation, Learning, and Group Behavior , 1997 .

[59]  Ronen I. Brafman,et al.  Modeling Agents as Qualitative Decision Makers , 1997, Artif. Intell..

[60]  Csaba Szepesvari,et al.  Module Based Reinforcement Learning for a Real Robot , 1997 .

[61]  Roderic A. Grupen,et al.  A feedback control structure for on-line learning tasks , 1997, Robotics Auton. Syst..

[62]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[63]  Doina Precup,et al.  Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[64]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[65]  R. Sutton,et al.  Macro-Actions in Reinforcement Learning: An Empirical Analysis , 1998 .

[66]  Kee-Eung Kim,et al.  Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[67]  Blai Bonet High-Level Planning and Control with Incomplete Information Using POMDP's , 1998 .

[68]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[69]  Chris Drummond,et al.  Composing Functions to Speed up Reinforcement Learning in a Changing World , 1998, ECML.

[70]  Paul R. Cohen,et al.  Concepts From Time Series , 1998, AAAI/IAAI.

[71]  S. Haykin,et al.  A Q-learning-based dynamic channel assignment technique for mobile communication systems , 1999 .

[72]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..