Using Control Theory for Analysis of Reinforcement Learning and Optimal Policy Properties in Grid-World Problems

Markov Decision Process (MDP) has enormous applications in science, engineering, economics and management. Most of decision processes have Markov property and can be modeled as MDP. Reinforcement Learning (RL) is an approach to deal with MDPs. RL methods are based on Dynamic Programming (DP) algorithms, such as Policy Evaluation, Policy Iteration and Value Iteration. In this paper, policy evaluation algorithm is represented in the form of a discrete-time dynamical system. Hence, using Discrete-Time Control methods, behavior of agent and properties of various policies, can be analyzed. The general case of grid-world problems is addressed, and some important results are obtained for this type of problems as a theorem. For example, equivalent system of an optimal policy for a grid-world problem is dead-beat.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[3]  M. Uschold,et al.  Methods and applications , 1953 .

[4]  Robert Givan,et al.  Bounded-parameter Markov decision processes , 2000, Artif. Intell..

[5]  Stuart I. Reynolds Reinforcement Learning with Exploration , 2002 .

[6]  Katsuhiko Ogata,et al.  Discrete-time control systems (2nd ed.) , 1995 .

[7]  Q. Hu,et al.  Markov decision processes with their applications , 2007 .

[8]  Daniela Pucci de Farias,et al.  Approximate value iteration and temporal-difference learning , 2000 .

[9]  Manuela Veloso,et al.  Probabilistic Reuse of Past Policies , 2005 .

[10]  Benjamin Van Roy Neuro-Dynamic Programming: Overview and Recent Trends , 2002 .

[11]  Geoffrey E. Hinton,et al.  Reinforcement learning for factored Markov decision processes , 2002 .

[12]  Art Lew,et al.  Dynamic Programming: A Computational Tool , 2006 .

[13]  James E. Smith,et al.  Structural Properties of Stochastic Dynamic Programs , 2002, Oper. Res..

[14]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[15]  Manuela Veloso,et al.  Building a Library of Policies through Policy Reuse , 2005 .

[16]  Steven I. Marcus,et al.  A survey of some simulation-based algorithms for Markov decision processes , 2007, Commun. Inf. Syst..

[17]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[18]  Larry D. Pyeatt Integration of Partially Observable Markov Decision Processes and Reinforcement Learning for Simulat , 1999 .

[19]  Benjamin Van Roy,et al.  On the existence of fixed points for approximate value iteration and temporal-difference learning , 2000 .

[20]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[21]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[22]  Manuela Veloso,et al.  Exploration and Policy Reuse , 2005 .

[23]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[24]  Eric A. Hansen,et al.  An Improved Policy Iteration Algorithm for Partially Observable MDPs , 1997, NIPS.

[25]  Katsuhiko Ogata,et al.  Discrete-time control systems , 1987 .

[26]  Weihong Zhang,et al.  Speeding Up the Convergence of Value Iteration in Partially Observable Markov Decision Processes , 2011, J. Artif. Intell. Res..

[27]  Michael C. Fu,et al.  Monotone Optimal Policies for a Transient Queueing Staffing Problem , 2000, Oper. Res..

[28]  Jiaqiao Hu,et al.  Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering) , 2007 .

[29]  Daniel S. Bernstein,et al.  Reusing Old Policies to Accelerate Learning on New MDPs , 1999 .

[30]  A. Shwartz,et al.  Handbook of Markov decision processes : methods and applications , 2002 .

[31]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.