Computing and Using Lower and Upper Bounds for Action Elimination in MDP Planning

We describe a way to improve the performance of MDP planners by modifying them to use lower and upper bounds to eliminate non-optimal actions during their search. First, we discuss a particular state-abstraction formulation of MDP planning problems and how to use that formulation to compute bounds on the Q-functions of those planning problems. Then, we describe how to incorporate those bounds into a large class of MDP planning algorithms to control their search during planning. We provide theorems establishing the correctness of this technique and an experimental evaluation to demonstrate its effectiveness. We incorporated our ideas into two MDP planners: the Real Time Dynamic Programming (RTDP) algorithm [1] and the Adaptive Multistage (AMS) sampling algorithm [2], taken respectively from automated planning and operations research communities. Our experiments on an Unmanned Aerial Vehicles (UAVs) path planning problem demonstrate that our action-elimination technique provides significant speed-ups in the performance of both RTDP and AMS.

[1]  Robert Givan,et al.  Model Reduction Techniques for Computing Approximately Optimal Solutions for Markov Decision Processes , 1997, UAI.

[2]  Richard E. Korf,et al.  Optimal path-finding algorithms* , 1988 .

[3]  T. Dean,et al.  Planning under uncertainty: structural assumptions and computational leverage , 1996 .

[4]  Craig Boutilier,et al.  Abstraction and Approximate Decision-Theoretic Planning , 1997, Artif. Intell..

[5]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[6]  Raffaello D'Andrea,et al.  Path Planning for Unmanned Aerial Vehicles in Uncertain and Adversarial Environments , 2003 .

[7]  Fausto Giunchiglia,et al.  A Theory of Abstraction , 1992, Artif. Intell..

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[10]  J. MacQueen A MODIFIED DYNAMIC PROGRAMMING METHOD FOR MARKOVIAN DECISION PROBLEMS , 1966 .

[11]  Jonathan Schaeffer,et al.  Efficiently Searching the 15-Puzzle , 1994 .

[12]  Richard E. Korf,et al.  Finding Optimal Solutions to Rubik's Cube Using Pattern Databases , 1997, AAAI/IAAI.

[13]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[14]  Stanley E. Zin,et al.  SPLINE APPROXIMATIONS TO VALUE FUNCTIONS: Linear Programming Approach , 1997 .

[15]  David Kelley A theory of abstraction. , 1984 .

[16]  Craig Boutilier,et al.  Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[17]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for Reinforcement Learning , 2003, ICML.

[18]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[19]  Panos M. Pardalos,et al.  Cooperative Control: Models, Applications, and Algorithms , 2003 .

[20]  Blai Bonet,et al.  Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[21]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[22]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[23]  Balaraman Ravindran,et al.  SMDP Homomorphisms: An Algebraic Approach to Abstraction in Semi-Markov Decision Processes , 2003, IJCAI.

[24]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[25]  Alfredo Milani,et al.  New directions in AI planning , 1996 .

[26]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[27]  Drew McDermott,et al.  Modeling a Dynamic and Uncertain World I: Symbolic and Probabilistic Reasoning About Change , 1994, Artif. Intell..

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[30]  Michael C. Fu,et al.  An Adaptive Sampling Algorithm for Solving Markov Decision Processes , 2005, Oper. Res..

[31]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[32]  Robert Givan,et al.  Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[33]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..