Trial-Based Heuristic Tree Search for MDPs with Factored Action Spaces

MDPs with factored action spaces, i.e. where actions are described as assignments to a set of action variables, allow reasoning over action variables instead of action states, yet most algorithms only consider a grounded action representation. This includes algorithms that are instantiations of the trial-based heuristic tree search (THTS) framework, such as AOor UCT. To be able to reason over factored action spaces, we propose a generalisation of THTS where nodes that branch over all applicable actions are replaced with subtrees that consist of nodes that represent the decision for a single action variable. We show that many THTS algorithms retain their theoretical properties under the generalised framework, and show how to approximate any state-action heuristic to a heuristic for partial action assignments. This allows to guide a UCT variant that is able to create exponentially fewer nodes than the same algorithm that considers ground actions. An empirical evaluation on the benchmark set of the probabilistic track of the latest International Planning Competition validates the benefits of the approach.

[1]  Mario A. Nascimento,et al.  Action Abstractions for Combinatorial Multi-Armed Bandit Tree Search , 2018, AIIDE.

[2]  Alan Fern,et al.  Dynamic Resource Allocation for Optimizing Population Diffusion , 2014, AISTATS.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Thomas Keller,et al.  Anytime optimal MDP planning with trial-based heuristic tree search , 2015 .

[5]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[7]  Florian Geißer An Analysis of the Probabilistic Track of the IPC 2018 , 2019 .

[8]  Florian Geißer,et al.  PROST-DD-Utilizing Symbolic Classical Planning in THTS , 2018 .

[9]  Pablo H. Ibargüengoytia,et al.  Open Questions for Building Optimal Operation Policies for Dam Management Using Factored Markov Decision Processes , 2015, AAAI Fall Symposia.

[10]  Malte Helmert,et al.  Concise finite-domain representations for PDDL planning tasks , 2009, Artif. Intell..

[11]  Daphne Koller,et al.  Computing Factored Value Functions for Policies in Structured MDPs , 1999, IJCAI.

[12]  Thomas Keller,et al.  PROST: Probabilistic Planning Based on UCT , 2012, ICAPS.

[13]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[14]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[15]  Olivier Regnier-Coudert,et al.  Evolutionary approaches to dynamic earth observation satellites mission planning under uncertainty , 2019, GECCO.

[16]  Santiago Ontañón,et al.  The Combinatorial Multi-Armed Bandit Problem and Its Application to Real-Time Strategy Games , 2013, AIIDE.

[17]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[18]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[19]  Jendrik Seipp,et al.  From Non-Negative to General Operator Cost Partitioning , 2015, AAAI.

[20]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[21]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[22]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[23]  Malte Helmert,et al.  Trial-Based Heuristic Tree Search for Finite Horizon MDPs , 2013, ICAPS.

[24]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.