On the Signi � cance of Markov Decision Processes

Formulating the problem facing an intelligent agent as a Markov decision process MDP is increasingly common in arti cial in telligence reinforcement learning arti cial life and arti cial neural net works In this short paper we examine some of the reasons for the appeal of this framework Foremost among these are its generality simplicity and emphasis on goal directed interaction between the agent and its en vironment MDPs may be becoming a common focal point for di erent approaches to understanding the mind Finally we speculate that this focus may be an enduring one insofar as many of the e orts to extend the MDP framework end up bringing a wider class of problems back within it Sometimes the establishment of a problem is a major step in the development of a eld more important than discovery of solution methods For example the problem of supervised learning has played a central role as it has developed through pattern recognition statistics machine learning and arti cial neural networks Regulation of linear systems has practically de ned the eld of control theory for decades To understand what has happened in these and other elds it is essential to track the origins development and range of acceptance of particular problem classes Major points of change are marked sometimes by a new solution to an existing problem but just as often by the promulgation and recognition of the signi cance of a new problem Now may be one such time of transition in the study of mental processes with Markov decision processes being the newly accepted problem Markov decision processes MDPs originated in the study of stochastic op timal control Bellman and have remained the key problem in that area ever since In the s and s incompletely known MDPs were gradually recognized as a natural problem formulation for reinforcement learning e g Witten Watkins Sutton and Barto Recognizing the com mon problem led to the discovery of a wealth of common algorithmic ideas and theoretical analyses MDPs have also come to be widely studied within AI as a new particularly suitable kind of planning problem e g as in decision theoretic planning e g Dean et al and in conjunction with structured Bayes nets e g Boutilier et al In robotics arti cial life and evolutionary methods it is less common to use the language and mathematics of MDPs but again the problems considered are well expressed in MDP terms Recognition of this common problem is likely to lead to greater understanding and cross fertilization among these elds MDPs provide a simple precise general and relatively neutral way of talking about a learning or planning agent interacting with its environment to achieve a goal As such MDPs are starting to provide a bridge to biological e orts to understand the mind Analyses in MDP like terms can be made in neuroscience e g Schultz et al Houk et al and in psychology e g Sutton and Barto Barto et al Of course the modern drive for an interdisci plinary understanding of mind is larger than the interest in MDPs the interest in MDPs is a product of the search for an interdisciplinary understanding But MDPs are an important conceptual tool contributing to a common understand ing of intelligence in animals and machines

[1]  R. Bellman A Markovian Decision Process , 1957 .

[2]  Ian H. Witten Exploring, Modelling, and Controlling Discrete Sequential Environments , 1977 .

[3]  C. Watkins Learning from delayed rewards , 1989 .

[4]  Richard S. Sutton,et al.  Learning and Sequential Decision Making , 1989 .

[5]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[6]  Satinder Singh Transfer of Learning by Composing Solutions of Elemental Sequential Tasks , 1992, Mach. Learn..

[7]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[8]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[9]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[10]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[11]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[12]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[13]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[14]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[15]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[16]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[17]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[18]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[19]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[20]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[21]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[22]  Benjamin Van Roy,et al.  A neuro-dynamic programming approach to retailer inventory management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[23]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .