论文信息 - Between MDPs and Semi-MDPs : Learning , Planning , and Representing Knowledge at Multiple Temporal Scales

Between MDPs and Semi-MDPs : Learning , Planning , and Representing Knowledge at Multiple Temporal Scales

Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action to include options—whole courses of behavior that may be temporally extended, stochastic, and contingent on events. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Options may be given a priori, learned by experience, or both. They may be used interchangeably with actions in a variety of planning and learning methods. The theory of semi-Markov decision processes (SMDPs) can be applied to model the consequences of options and as a basis for planning and learning methods using them. In this paper we develop these connections, building on prior work by Bradtke and Duff (1995), Parr (in prep.) and others. Our main novel results concern the interface between the MDP and SMDP levels of analysis. We show how a set of options can be altered by changing only their termination conditions to improve over SMDP methods with no additional cost. We also introduce intra-option temporal-difference methods that are able to learn from fragments of an option’s execution. Finally, we propose a notion of subgoal which can be used to improve the options themselves. Overall, we argue that options and their models provide hitherto missing aspects of a powerful, clear, and expressive framework for representing and organizing knowledge. 1. Temporal Abstraction To make everyday decisions, people must foresee the consequences of their possible courses of action at multiple levels of temporal abstraction. Consider a traveler deciding to undertake a journey to a distant city. To decide whether or not to go, the benefits of the trip must be weighed against the expense. Having decided to go, choices must be made at each leg, e.g., whether to fly or to drive, whether to take a taxi or to arrange a ride. Each of these steps involves foresight and decision, all the way down to the smallest of actions. For example, just to call a taxi may involve finding a telephone, dialing each digit, and the individual muscle contractions to lift the receiver to the ear. Human decision making routinely involves c ©1998 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved. Sutton, Precup, & Singh (Revised) planning and foresight—choice among temporally-extended options—over a broad range of time scales. In this paper we examine the nature of the knowledge needed to plan and learn at multiple levels of temporal abstraction. The principal knowledge needed is the ability to predict the consequences of different courses of action. This may seem straightforward, but it is not. It is not at all clear what we mean either by a “course of action” or, particularly, by “its consequences”. One problem is that most courses of action have many consequences, with the immediate consequences different from the longer-term ones. For example, the course of action go-to-the-librarymay have the near-term consequence of being outdoors and walking, and the long-term consequence of being indoors and reading. In addition, we usually only consider courses of action for a limited but indefinite time period. An action like wash-the-car is most usefully executed up until the car is clean, but without specifying a particular time at which it is to stop. We seek a way of representing predictive knowledge that is: Expressive The representation must be able to include basic kinds of commonsense knowledge such as the examples we have mentioned. In particular, it should be able to predict consequences that are temporally extended and uncertain. This criterion rules out many conventional engineering representations, such as differential equations and transition probabilities. The representation should also be able to predict the consequences of courses of action that are stochastic and contingent on subsequent observations. This rules out simple sequences of action with a deterministically known outcome, such as conventional macro-operators. Clear The representation should be clear, explicit, and grounded in primitive observations and actions. Ideally it would be expressed in a formal mathematical language. Any predictions made should be testable simply by comparing them against data: no human interpretation should be necessary. This criterion rules out conventional AI representations with ungrounded symbols. For example, “Tweety is a bird” relies on people to understand “Tweety,” “Bird,” and “is-a”; none of these has a clear interpretation in terms of observables. A related criterion is that the representation should be learnable. Only a representation that is clear and directly testable from observables is likely to be learnable. A clear representation need not be unambiguous. For example, it could predict that one of two events will occur at a particular time, but not specify which of them will occur. Suitable for Planning A representation of knowledge must be suitable for how it will be used as part of planning and decision-making. In particular, the representation should enable interrelating and intermixing knowledge at different levels of temporal abstraction. It should be clear that we are addressing a fundamental question of AI: how should an intelligent agent represent its knowledge of the world? We are interested here in the underlying semantics of the knowledge, not with its surface form. In particular, we are not concerned with the data structures of the knowledge representation, e.g., whether the

R. Sutton

[1] Richard Fikes,et al. Learning and Executing Generalized Robot Plans , 1993, Artif. Intell..

[2] Earl D. Sacerdoti,et al. Planning in a Hierarchy of Abstraction Spaces , 1974, IJCAI.

[3] Allen Newell,et al. Human Problem Solving. , 1973 .

[4] Nils J. Nilsson,et al. A Hierarchical Robot Planning and Execution System. , 1973 .

[5] Benjamin Kuipers,et al. Common-Sense Knowledge of Space: Learning from Experience , 1979, IJCAI.

[6] R. Korf. Learning to solve problems by searching for macro-operators , 1983 .

[7] Johan de Kleer,et al. A Qualitative Physics Based on Confluences , 1984, Artif. Intell..

[8] Rodney A. Brooks,et al. A Robust Layered Control Syste For A Mobile Robot , 2022 .

[9] Richard E. Korf,et al. Planning as Search: A Quantitative Approach , 1987, Artif. Intell..

[10] Rodney A. Brooks,et al. Learning to Coordinate Behaviors , 1990, AAAI.

[11] Lambert E. Wixson,et al. Scaling Reinforcement Learning Techniques via Modularity , 1991, ML.