On the Significance of Markov Decision Processes

Formulating the problem facing an intelligent agent as a Markov decision process (MDP) is increasingly common in artificial intelligence, reinforcement learning, artificial life, and artificial neural networks. In this short paper we examine some of the reasons for the appeal of this framework. Foremost among these are its generality, simplicity, and emphasis on goal-directed interaction between the agent and its environment. MDPs may be becoming a common focal point for different approaches to understanding the mind. Finally, we speculate that this focus may be an enduring one insofar as many of the efforts to extend the MDP framework end up bringing a wider class of problems back within it.

[1]  R. Bellman A Markovian Decision Process , 1957 .

[2]  Ian H. Witten,et al.  Exploring, Modelling and Controlling Discrete Sequential Environments , 1977, Int. J. Man Mach. Stud..

[3]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[4]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[5]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[6]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[7]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[8]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[9]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[10]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[11]  Alan Bundy,et al.  Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence - IJCAI-95 , 1995 .

[12]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[13]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[14]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[15]  Joel L. Davis,et al.  In : Models of Information Processing in the Basal Ganglia , 2008 .

[16]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[17]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[18]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[19]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[20]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[21]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[22]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[23]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[24]  Benjamin Van Roy,et al.  A neuro-dynamic programming approach to retailer inventory management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.