On the Signiicance of Markov Decision Processes

Formulating the problem facing an intelligent agent as a Markov decision process (MDP) is increasingly common in artiicial intelligence , reinforcement learning, artiicial life, and artiicial neural networks. In this short paper we examine some of the reasons for the appeal of this framework. Foremost among these are its generality, simplicity, and emphasis on goal-directed interaction between the agent and its environment. MDPs may be becoming a common focal point for diierent approaches to understanding the mind. Finally, we speculate that this focus may be an enduring one insofar as many of the eeorts to extend the MDP framework end up bringing a wider class of problems back within it. Sometimes the establishment of a problem is a major step in the development of a eld, more important than discovery of solution methods. For example, the problem of supervised learning has played a central role as it has developed through pattern recognition, statistics, machine learning and artiicial neural networks. Regulation of linear systems has practically deened the eld of control theory for decades. To understand what has happened in these and other elds it is essential to track the origins, development, and range of acceptance of particular problem classes. Major points of change are marked sometimes by a new solution to an existing problem, but just as often by the promulgation and recognition of the signiicance of a new problem. Now may be one such time of transition in the study of mental processes, with Markov decision processes being the newly accepted problem. Markov decision processes (MDPs) originated in the study of stochastic optimal control (Bellman, 1957) and have remained the key problem in that area ever since. In the 1980s and 1990s, incompletely known MDPs were gradually recognized as a natural problem formulation for reinforcement learning (e. Recognizing the common problem led to the discovery of a wealth of common algorithmic ideas and theoretical analyses. MDPs have also come to be widely studied within AI as a new, particularly suitable kind of planning problem, e.g., as in decision-theoretic planning (e.g., Dean et al., 1995), and in conjunction with structured Bayes nets (e.g., Boutilier et al., 1995). In robotics, artiicial life, and evolutionary methods it is less common to use the language and mathematics of MDPs, but again the problems considered are well expressed in MDP terms. Recognition of this

[1]  Benjamin Van Roy,et al.  A neuro-dynamic programming approach to retailer inventory management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[2]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[3]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[4]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[5]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[6]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[7]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[8]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[9]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[10]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[11]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[12]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[13]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[14]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[15]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[16]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[17]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[18]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[19]  Ian H. Witten,et al.  Exploring, Modelling and Controlling Discrete Sequential Environments , 1977, Int. J. Man Mach. Stud..