Active Tree Search

Monte-Carlo tree search is based on contiguous rollouts. Since not all samples within a rollout necessarily provide relevant information, contiguous rollouts may be wasteful as compared to sampling selected transitions. In this paper, we describe an active learning approach that can be used to select single transition within the tree for sampling with the goal of maximizing information gain. We show that this approach can be used to enhance purely rollout-based MCTS by actively sampling single transitions in addition to performing contiguous rollouts. We demonstrate that our method outperforms classical MCTS in a prototypical domain and discuss the interplay of the active learning component with the classical rollout-based sampling strategy. Introduction Monte-Carlo tree search (MCTS) has become a standard planning method that has been successfully applied in various domains, ranging from computer Go to large-scale POMDPs (Silver et al. 2016; Browne et al. 2012). An appealing property of MCTS is that it is sufficient to be able to simulate transitions in the environment. Planning is then performed by simulating contiguous rollouts from the root node. When collecting new samples, an important concern is to improve the estimates of the transition and reward function. For this concern contiguous rollouts may be wasteful because not all samples along a rollout necessarily provide relevant information. For instance, transitions that are (close to) deterministic do not require as many samples for a good estimate as transitions with high stochasticity. We therefore suggest an alternative to rollout-based sampling by formulating an active learning measure (Settles 2009) that can be used to select single transitions for sampling anywhere in the tree. In order to make computation of the involved expectations tractable we derive an efficiently approximation based on reverse accumulation of the objective gradient through the tree. In a prototypical domain, we demonstrate that combining our active learning measure with rollout-based sampling outperforms classical MCTS. We also discuss the interplay between active samples and rollout-based samples providing deeper insight into the different concerns to be addressed and giving directions for further research. Our main contributions are as follows • We formulate an active learning measure for selecting single transition to be sampled. • We derive an efficient approximation of our measure for practical application. • Provide an enhanced MCTS algorithm by combining our active learning measure with rollout-based samples. • We empirically show that our enhanced method outperforms classical MCTS in a prototypical domain. • We discuss the characteristics, possible shortcomings, and possible extensions of our method. In the remainder of this paper we will first discuss related work on MCTS and active learning, then present our active learning measure and show how to approximate it efficiently, and finally present our empirical evaluations and discuss the characteristics of our method. Related Work Monte-Carlo Tree Search Monte-Carlo tree search (MCTS, Browne et al. 2012) comes in a number of flavors that mainly differ in three respects (Keller and Helmert 2013): (1) the tree-policy that is used for selecting actions (2) the value heuristic that is used for initializing leaf nodes and (3) the backup method that is used for propagating information back to the root node. In this work we focus on sampling transitions within the tree, which is the task of the tree-policy in conventional MCTS. The tree-policy has to balance exploration and exploitation. That is, it has to choose actions that help improving the estimates of the action values (exploration) but it also has to choose actions with a high value in order to focus subsequent sampling and expansion of the tree on relevant regions of the state space (exploitation). These two concerns are somewhat conflicting and the attempt to address them separately has led to alternative scheduling schemes for the tree-policy (Feldman and Domshlak 2012). However, to our knowledge, there is no work on departing from a rollout-based scheme for sampling transitions, which is exactly what we suggest in this paper. While we still use a tree-policy to select leaf nodes for expansion we additionally sample single transitions within the tree with the goal of maximizing information gain. In a prototypical domain we demonstrate that this combined method outperforms purely rollout-based MCTS Active Learning The goal of active learning (Settles 2009) generally is to select samples optimally for learning a property of interest. For MCTS the multi-armed bandit problem (MAB, Berry and Fristedt 1985) is of particular interest as most research on improving the tree-policy is based on MABs. To transfer results from MABs to MCTS, action selection in each decision node is treated as a separate MAB with non-stationary reward distribution. The overall problem of sampling transitions within the tree is thus split up into a series of simpler problems. The approach we suggest in this paper is different in that we do not break down the problem into a series of MABs but instead formulate the problem of choosing a new transition to be sampled anywhere in the tree as a single active learning problem. A common objective for active learning, especially when formulated in the framework of optimal experimental design (Chaloner and Verdinelli 1995), is to minimize the uncertainty of the distribution of interest as measured by the entropy or the variance. Our objective in this work is to minimize the state-value variance at the root node by sampling transitions that maximize its expected change. A major contribution of this paper is an efficient approximation of this objective by propagating its gradient through the tree. Active Tree Search We will first formally state the problem of sampling-based planning and then present our active learning approach to solve it. In sampling-based planning the planner can repeatedly use a black-box simulator for sampling transitions, that is, query state-action pairs (s, a) ∈ S × A and observe the resulting state and reward (s′, ρ) ∈ S × R in response. The value Q of action a in state s under policy π is defined as