Trial-Based Heuristic Tree Search for Finite Horizon MDPs

Dynamic programming is a well-known approach for solving MDPs. In large state spaces, asynchronous versions like Real-Time Dynamic Programming have been applied successfully. If unfolded into equivalent trees, Monte-Carlo Tree Search algorithms are a valid alternative. UCT, the most popular representative, obtains good anytime behavior by guiding the search towards promising areas of the search tree. The Heuristic Search algorithm AO* finds optimal solutions for MDPs that can be represented as acyclic AND/OR graphs. We introduce a common framework, Trial-based Heuristic Tree Search, that subsumes these approaches and distinguishes them based on five ingredients: heuristic function, backup function, action selection, outcome selection, and trial length. Using this framework, we describe three new algorithms which mix these ingredients in novel ways in an attempt to combine their different strengths. Our evaluation shows that two of our algorithms not only provide superior theoretical properties to UCT, but also outperform state-of-the-art approaches experimentally.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Mihalis Yannakakis,et al.  Shortest Paths Without a Map , 1989, Theor. Comput. Sci..

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[9]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[10]  Blai Bonet,et al.  Faster Heuristic Search Algorithms for Planning with Uncertainty and Full Feedback , 2003, IJCAI.

[11]  Blai Bonet,et al.  Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[12]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Blai Bonet,et al.  Learning Depth-First Search: A Unified Approach to Heuristic Search in Deterministic and Non-Deterministic Settings, and Its Application to MDPs , 2006, ICAPS.

[15]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[16]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[17]  Levente Kocsis,et al.  Transpositions and move groups in Monte Carlo tree search , 2008, 2008 IEEE Symposium On Computational Intelligence and Games.

[18]  Jean Méhat,et al.  UCD: Upper Confidence Bound for Rooted Directed Acyclic Graphs , 2010, 2010 International Conference on Technologies and Applications of Artificial Intelligence.

[19]  Malte Helmert,et al.  High-Quality Policies for the Canadian Traveler's Problem , 2010, SOCS.

[20]  P. Schrimpf,et al.  Dynamic Programming , 2011 .

[21]  Mausam,et al.  LRTDP Versus UCT for Online Probabilistic Planning , 2012, AAAI.

[22]  Carmel Domshlak,et al.  Online Planning in MDPs: Rationality and Optimization , 2012, ArXiv.

[23]  Blai Bonet,et al.  Action Selection for MDPs: Anytime AO* Versus UCT , 2012, AAAI.

[24]  Thomas Keller,et al.  PROST: Probabilistic Planning Based on UCT , 2012, ICAPS.

[25]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[26]  Peng Dai,et al.  Reverse Iterative Deepening for Finite-Horizon MDPs with Large Branching Factors , 2012, ICAPS.