Generalizing Dijkstra's Algorithm and Gaussian Elimination for Solving MDPs

Abstract : The authors study the problem of computing the optimal value function for a Markov Decision Process (MDP) with positive costs. Computing this function quickly and accurately is a basic step in many schemes for deciding how to act in stochastic environments. There are efficient algorithms that compute value functions for special types of MDPs. For deterministic MDPs with S states and A actions, Dijkstra's algorithm runs in time O(AS log S). And, in single-action MDPs (Markov chains), standard linear-algebraic algorithms find the value function in time O(S sup 3), or faster by taking advantage of sparsity or good conditioning. Algorithms for solving general MDPs can take much longer: the authors are not aware of any speed guarantees better than those for comparably sized linear programs. They present a family of algorithms that reduce to Dijkstra's algorithm when applied to deterministic MDPs, and to standard techniques for solving linear equations when applied to Markov chains. More importantly, they demonstrate experimentally that these algorithms perform well when applied to MDPs that "almost" have the required special structure.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[3]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[4]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[5]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[6]  Thomas G. Dietterich,et al.  Explanation-Based Learning and Reinforcement Learning: A Unified View , 1995, Machine-mediated learning.

[7]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[10]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[11]  David Andre,et al.  Generalized Prioritized Sweeping , 1997, NIPS.

[12]  Marco Wiering,et al.  Explorations in efficient reinforcement learning , 1999 .

[13]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[14]  Anshul Gupta,et al.  Recent advances in direct methods for solving unsymmetric sparse systems of linear equations , 2002, TOMS.

[15]  William H. Press,et al.  Numerical recipes in C , 2002 .

[16]  Blai Bonet,et al.  Faster Heuristic Search Algorithms for Planning with Uncertainty and Full Feedback , 2003, IJCAI.

[17]  Blai Bonet,et al.  Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[18]  Anthony Stentz,et al.  Focused Dynamic Programming: Extensive Comparative Results , 2004 .

[19]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[20]  Geoffrey J. Gordon,et al.  Fast Exact Planning in Markov Decision Processes , 2005, ICAPS.

[21]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.