Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming

RTDP is a recent heuristic-search DP algorithm for solving non-deterministic planning problems with full observability. In relation to other dynamic programming methods, RTDP has two benefits: first, it does not have to evaluate the entire state space in order to deliver an optimal policy, and second, it can often deliver good policies pretty fast. On the other hand, RTDP final convergence is slow. In this paper we introduce a labeling scheme into RTDP that speeds up its convergence while retaining its good anytime behavior. The idea is to label a state s as solved when the heuristic values, and thus, the greedy policy defined by them, have converged over s and the states that can be reached from s with the greedy policy. While due to the presence of cycles, these labels cannot be computed in a recursive, bottom-up fashion in general, we show nonetheless that they can be computed quite fast, and that the overhead is compensated by the recomputations avoided. In addition, when the labeling procedure cannot label a state as solved, it improves the heuristic value of a relevant state. This results in the number of Labeled RTDP trials needed for convergence, unlike the number of RTDP trials, to be bounded. From a practical point of view, Labeled RTDP (LRTDP) converges orders of magnitude faster than RTDP, and faster also than another recent heuristic-search DP algorithm, LAO*. Moreover, LRTDP often converges faster than value iteration, even with the heuristic h = 0, thus suggesting that LRTDP has a quite general scope.

[1]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Richard E. Korf,et al.  Finding Optimal Solutions to the Twenty-Four Puzzle , 1996, AAAI/IAAI, Vol. 2.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Richard E. Korf,et al.  Finding Optimal Solutions to Rubik's Cube Using Pattern Databases , 1997, AAAI/IAAI.

[5]  Blai Bonet,et al.  GPT: A Tool for Planning with Uncertainty and Partial Information , 2001, IJCAI 2001.

[6]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[7]  Blai Bonet,et al.  Planning with Incomplete Information as Heuristic Search in Belief Space , 2000, AIPS.

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[10]  T. Dean,et al.  Planning under uncertainty: structural assumptions and computational leverage , 1996 .

[11]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[12]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[13]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[14]  Allen Newell,et al.  Human Problem Solving. , 1973 .

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[18]  Blai Bonet,et al.  Planning and Control in Artificial Intelligence: A Unifying Perspective , 2001, Applied Intelligence.

[19]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .