Markov Decision Processes: Concepts and Algorithms

Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. First the formal framework of Markov decision process is defined, accompanied by the definition of value functions and policies. The main part of this text deals with introducing foundational classes of algorithms for learning optimal behaviors, based on various definitions of optimality with respect to the goal of learning sequential decisions. Additionally, it surveys efficient extensions of the foundational algorithms, differing mainly in the way feedback given by the environment is used to speed up learning, and in the way they concentrate on relevant parts of the problem. For both model-based and model-free settings these efficient extensions have shown useful in scaling up to larger problems. M ARKOV DECISION PROCESSES (MDP) (Puterman, 1994) are an intuitive and fundamental formalism for decision-theoretic planning (DTP) (Boutilier et al., 1999; Boutilier, 1999), reinforcement learning (RL) (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998; Kaelbling et al., 1996) and other learning problems in stochastic domains. In this model, an environment is modelled as a set of states and actions can be performed to control the system’s state. The goal is to control the system in such a way that some performance criterium is maximized. Many problems such as (stochastic) planning problems, learning robot control and game playing problems have successfully been modelled in terms of an MDP. In fact MDPs have become the de facto standard formalism for learning sequential decision making. DTP (Boutilier et al., 1999), e.g. planning using decision-theoretic notions to represent uncertainty and plan quality, is an important extension of the AI planning paradigm, adding the ability to deal with uncertainty in action effects and the ability to deal with less-defined goals. Furthermore it adds a significant dimension in that it considers situations in which factors such as resource consumption and uncertainty demand solutions of varying quality, for example in real-time decision situations. There are many connections between AI planning, research done in the field of operations research (Winston, 1991) and control theory (Bertsekas, 1995), as most work in these fields on sequential decision making can be viewed as instances of MDPs. The notion of a plan in AI planning, i.e. a series of actions from a start state to a goal state, is extended to the notion of a policy, which is mapping from all states to an (optimal) action, based on decision-theoretic measures of optimality with respect to some goal to be optimized. As an example, consider a typical planning domain, involving boxes to be moved around and where the goal is to move some particular boxes to a designated area. This type of problems can be solved using AI planning techniques. Consider now a slightly more realistic extension in which some of the actions can fail, or have uncertain side-effects that can depend on factors beyond the ∗Compiled from draft material from ”The Logic of Adaptive Behavior” by Martijn van Otterlo (van Otterlo, 2008).

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[3]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[4]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[5]  K. A. F. Ramling Bi-Memory Model for Guiding Exploration by Pre-existing Knowledge , 2005 .

[6]  Martijn van Otterlo,et al.  The logic of adaptive behavior : knowledge representation and algorithms for the Markov decision process framework in first-order domains , 2008 .

[7]  Bohdana Ratitch On characteristics of markov decision processes and reinforcement learning in large domains , 2005 .

[8]  Marcus A. Maloof,et al.  Incremental rule learning with partial instance memory for changing concepts , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[9]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[12]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  W. Matthews Mazes and Labyrinths: A General Account of Their History and Developments , 2015, Nature.

[15]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[17]  Adaptive State-Space Quantisation and Multi-Task Reinforcement Learning Using . . . , 2000 .

[18]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[19]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.

[20]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[21]  Nicholas Kushmerick,et al.  An Algorithm for Probabilistic Planning , 1995, Artif. Intell..

[22]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[23]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[24]  Sven Koenig,et al.  The interaction of representations and planning objectives for decision-theoretic planning tasks , 2002, J. Exp. Theor. Artif. Intell..

[25]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[26]  Marco Wiering,et al.  Explorations in efficient reinforcement learning , 1999 .

[27]  Craig Boutilier,et al.  Knowledge Representation for Stochastic Decision Process , 1999, Artificial Intelligence Today.

[28]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[29]  Jonathan Schaeffer,et al.  Kasparov versus Deep Blue: The Rematch , 1997, J. Int. Comput. Games Assoc..

[30]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[31]  Wayne L. Winston Operations research: applications and algorithms / Wayne L. Winston , 2004 .

[32]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[33]  Stuart I. Reynolds Reinforcement Learning with Exploration , 2002 .

[34]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[35]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[36]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.