Reachability and Differential based Heuristics for Solving Markov Decision Processes

The solution convergence of Markov Decision Processes (MDPs) can be accelerated by prioritized sweeping of states ranked by their potential impacts to other states. In this paper, we present new heuristics to speed up the solution convergence of MDPs. First, we quantify the level of reachability of every state using the Mean First Passage Time (MFPT) and show that such reachability characterization very well assesses the importance of states which is used for effective state prioritization. Then, we introduce the notion of backup differentials as an extension to the prioritized sweeping mechanism, in order to evaluate the impacts of states at an even finer scale. Finally, we extend the state prioritization to the temporal process, where only partial sweeping can be performed during certain intermediate value iteration stages. To validate our design, we have performed numerical evaluations by comparing the proposed new heuristics with corresponding classic baseline mechanisms. The evaluation results showed that our reachability based framework and its differential variants have outperformed the state-of-the-art solutions in terms of both practical runtime and number of iterations.

[1]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[2]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[3]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[4]  D. J. White,et al.  A Survey of Applications of Markov Decision Processes , 1993 .

[5]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[6]  Groupe Pdmia Markov Decision Processes In Artificial Intelligence , 2009 .

[7]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[8]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[9]  J. Shanthikumar,et al.  First-passage times with PF r densities , 1985, Journal of Applied Probability.

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Olivier Buffet,et al.  Markov Decision Processes in Artificial Intelligence: Sigaud/Markov Decision Processes in Artificial Intelligence , 2013 .

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[13]  David Andre,et al.  Generalized Prioritized Sweeping , 1997, NIPS.

[14]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[15]  Gerald L. Thompson,et al.  Finite Mathematical Structures , 1960 .

[16]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[17]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[18]  Marco Wiering,et al.  Reinforcement Learning and Markov Decision Processes , 2012, Reinforcement Learning.

[19]  Craig Boutilier,et al.  Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[20]  Blai Bonet,et al.  Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[21]  Ronen I. Brafman,et al.  Structured Reachability Analysis for Markov Decision Processes , 1998, UAI.

[22]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[23]  John G. Kemeny,et al.  Finite Mathematical Structures , 1960 .

[24]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[25]  Kevin D. Seppi,et al.  Prioritization Methods for Accelerating MDP Solvers , 2005, J. Mach. Learn. Res..

[26]  David Andre,et al.  State abstraction for programmable reinforcement learning agents , 2002, AAAI/IAAI.