Action Selection for MDPs: Anytime AO* Versus UCT

In the presence of non-admissible heuristics, A* and other best-first algorithms can be converted into anytime optimal algorithms over OR graphs, by simply continuing the search after the first solution is found. The same trick, however, does not work for best-first algorithms over AND/OR graphs, that must be able to expand leaf nodes of the explicit graph that are not necessarily part of the best partial solution. Anytime optimal variants of AO* must thus address an exploration-exploitation tradeoff: they cannot just "exploit", they must keep exploring as well. In this work, we develop one such variant of AO* and apply it to finite-horizon MDPs. This Anytime AO* algorithm eventually delivers an optimal policy while using non-admissible random heuristics that can be sampled, as when the heuristic is the cost of a base policy that can be sampled with rollouts. We then test Anytime AO* for action selection over large infinite-horizon MDPs that cannot be solved with existing off-line heuristic search and dynamic programming algorithms, and compare it with UCT.

[1]  Wheeler Ruml,et al.  Anytime Heuristic Search: Frameworks and Algorithms , 2010, SOCS.

[2]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  R. Howard,et al.  Dynamic Probabilistic Systems, Volume I: Markov Models and Volume II: Semi- Markov and Decision Processes. , 1972 .

[4]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[6]  Eric A. Hansen,et al.  Anytime Heuristic Search , 2011, J. Artif. Intell. Res..

[7]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[8]  Malte Helmert,et al.  High-Quality Policies for the Canadian Traveler's Problem , 2010, SOCS.

[9]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[10]  Yngvi Björnsson,et al.  Simulation-Based Approach to General Game Playing , 2008, AAAI.

[11]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[12]  Alan Fern,et al.  UCT for Tactical Assault Planning in Real-Time Strategy Games , 2009, IJCAI.

[13]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[14]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[15]  Kenneth Foster,et al.  New math , 2009, IEEE Spectrum.

[16]  John N. Tsitsiklis,et al.  Rollout Algorithms for Combinatorial Optimization , 1997, J. Heuristics.

[17]  Blai Bonet,et al.  Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[18]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[19]  Sven Koenig,et al.  Comparing real-time and incremental heuristic search for real-time situated agents , 2009, Autonomous Agents and Multi-Agent Systems.

[20]  Sebastian Thrun,et al.  ARA*: Anytime A* with Provable Bounds on Sub-Optimality , 2003, NIPS.

[21]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[22]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[23]  Mihalis Yannakakis,et al.  Shortest Paths Without a Map , 1989, Theor. Comput. Sci..

[24]  Benjamin Van Roy,et al.  Solitaire: Man Versus Machine , 2004, NIPS.

[25]  Bart Selman,et al.  On Adversarial Search Spaces and Sampling-Based Planning , 2010, ICAPS.