Exploiting Variance Information in Monte-Carlo Tree Search

In bandit problems as well as in Monte-Carlo tree search (MCTS), variance-based policies such as UCB-V are reported to show better performance in practice than policies that ignore variance information, such as UCB1. For bandits, UCB-V was proved to exhibit somewhat better convergence properties than UCB1. In contrast, for MCTS so far no convergence guarantees have been established for UCB-V. Our first contribution is to show that UCB-V provides the same convergence guarantees in MCTS that are known for UCB1. Another open problem with variance-based policies in MCTS is that they can only be used in conjunction with Monte-Carlo backups but not with the recently suggested and increasingly popular dynamic programming (DP) backups. This is because standard DP backups do not propagate variance information. Our second contribution is to derive update equations for the variance in DP backups, which significantly extends the applicability of variance-based policies in MCTS. Finally, we provide an empirical analysis of UCB-V and UCB1 in two prototypical environments showing that UCB-V significantly outperforms UCB1 both with Monte-Carlo as well as with dynamic programming backups. Introduction Monte-Carlo tree search (MCTS) has become a standard planning method and has been successfully applied in various domains, ranging from computer Go to large-scale POMDPs (Silver et al. 2016; Browne et al. 2012). Some of the most appealing properties of MCTS are that it is easy to implement, does not require a full probabilistic model of the environment but only the ability to simulate state transitions, is suited for large-scale environments, and provides theoretical convergence guarantees. The core idea in MCTS is to treat a sequential decision problem as a series of bandit problems (Berry and Fristedt 1985). The main difference, however, is that in bandit problems the return distributions are assumed to be stationary whereas in MCTS they are not because the return distributions vary with the tree-policy. This means that convergence properties do not necessarily carry over from the bandit setting to MCTS. The most popular MCTS algorithm is UCT (Kocsis and Szepesvári 2006), which uses UCB1 (Auer, Cesa-Bianchi, and Fischer 2002) as tree-policy. UCB1 has proven bounds for the expected regret in the bandit setting as well as polynomial convergence guarantees for the failure probability in the MCTS setting. More recently, Audibert, Munos, and Szepesvári (2009) suggested UCB-V, which takes the empirical variance of the returns into account, and proved bounds for the expected regret in the bandit setting. In the case of MCTS, however, no convergence guarantees have been proved so far. Our first contribution in this paper is to show that UCB-V, just like UCB1, provides polynomial convergence guarantees in the MCTS setting. Apart from the tree-policy, an important aspect of an MCTS algorithms is the employed backup method. The most common variants are Monte-Carlo (MC) backups and the more recently suggested dynamic programming (DP) backups (Keller and Helmert 2013). DP backups have become increasingly popular because they show good convergence properties in practice (see Feldman and Domshlak 2014a for a comparison). The use of variance-based policies, however, has so far been restricted to MC backups since here the variance information is readily available. In contrast, DP backups do not generally propagate variance information. Our second contribution is the derivation of update equations for the variance that enable the use of variancebased policies in conjunction with DP backups. Finally, we evaluate UCB-V and UCB1 in different environments showing that, depending on the problem characteristics, UCB-V significantly outperforms UCB1 both with MC as well as with DP backups. In the remainder we will discuss related work on MCTS and reinforcement learning, present the proof for the convergence guarantees of UCB-V, derive the update equations for the variance with DP backups, and present our empirical results. Background & Related Work Monte-Carlo Tree Search There exists a wide variety of MCTS algorithms that differ in a number of aspects. Most of them follow a generic scheme that we reproduce in Alg. 1 for convenience. Note that some recent suggestions deviate slightly from this scheme (Keller and Helmert 2013; Feldman and Domshlak 2014b). In Alg. 1 we highlighted open parameters that need to be defined in order to produce a specific MCTS imAlgorithm 1 MCTS: Generic algorithm with open parameters for finit-horizon non-discounted environments. Notation: ( ) is a tuple; 〈 〉 is a list, + appends an element to the list, | l | is the length of list l, and li is its ith element. Input: v0 → root node s0 → current state M → environment model Output: a∗→ optimal action from root node / current state 1: function MCTS(v0, s0,M ) 2: while time permits do 3: (ρ, s)← FOLLOWTREEPOLICY(v0, s0) 4: R←FOLLOWDEFAULTPOLICY(s) 5: UPDATE(ρ,R) 6: end while 7: return BESTACTION(v0) → open parameter 8: end function 9: function FOLLOWTREEPOLICY(v, s) 10: ρ← 〈〉 11: do 12: a← TREEPOLICY(v) → open parameter 13: (s′, r)←M(a, s) 14: ρ← ρ+ 〈(v, s, a, s′, r)〉 15: v ← FINDNODE(v, a, s′) → open parameter 16: s← s′ 17: while v is not a leaf node 18: return (ρ, s) 19: end function 20: function FOLLOWDEFAULTPOLICY(s) 21: R← 0 22: repeat 23: a← DEFAULTPOLICY(s) → open parameter 24: (s′, r)←M(a, s) 25: R← R+ r 26: s← s′ 27: until s is terminal state 28: return R 29: end function 30: function UPDATE(ρ,R) 31: for i in |ρ|, . . . , 1 do 32: (v, s, a, s′, r)← ρi 33: BACKUP(v, s, a, s′, r, R) → open parameter 34: R← r +R 35: end for 36: end function plementation. Two of these parameters, the TREEPOLICY and the BACKUP method, will be discussed in more detail below. BESTACTION(v0) selects the action that is eventually recommended – usually the action with maximum empirical mean return (see e.g. Browne et al. 2012 for alternatives). FINDNODE(v, s, a, s′) selects a child node or creates a new leaf node if the child does not exist. This procedure usually builds a tree but it can also construct directed acyclic graphs (see e.g. Saffidine, Cazenave, and Méhat 2012). DEFAULTPOLICY(s) is a heuristic policy for initializing the return for new leaf nodes – usually the uniform policy. TREEPOLICY(v) The tree-policy selects actions in internal nodes and has to deal with the exploration-exploitation dilemma: It has to focus on high-return branches (exploitation) but it also has to sample sub-optimal branches to some extend (exploration) to make sure the estimated returns converge to the true ones. A common choice for the tree-policy is UCB1 (Auer, Cesa-Bianchi, and Fischer 2002), which chooses actions as1 a∗ = argmaxaB(s,a) with (1) B(s,a) = R̂(s,a) + 2Cp √ 2 log ns n(s,a) (2) where R̂(s,a) is the mean return of action a in state s, ns is the number of visits to state s, n(s,a) is the number of times action a was taken in state s, the returns are assumed to be in [ 0, 1 ], and the constant Cp > 0 controls exploration. For UCB1 Kocsis and Szepesvári (2006) proved that the probability of choosing a sub-optimal action at the root node converges to zero at a polynomial rate as the number of trials grows to infinity. More recently, Audibert, Munos, and Szepesvári (2009) suggested UCB-V that selects actions as a∗ = argmaxaB(s,a) with (3) B(s,a) = R̂(s,a)+ √ 2 R̃(s,a)ζ log ns n(s,a) +3cb ζ log ns n(s,a) (4) where R̂(s,a), ns, n(s,a) as above, R̃(s,a) is the empirical variance of the return, rewards are assumed to be in [ 0, b ], and the constants c, ζ > 0 control the algorithm’s behavior. For the bandit setting Audibert, Munos, and Szepesvári (2009) proved regret bounds but for the MCTS setting we are not aware of any proof similar to the one for UCB1. In Section Bounds and Convergence Guarantees we will adapt the proof of Kocsis and Szepesvári (2006) to show that UCB-V provides the same convergence guarantees as UCB1 in the MCTS setting. BACKUP(v, s, a, s′, r, R) The BACKUP procedure is responsible for updating node v given the transition (s, a) → (s′, r) and the return R of the corresponding trial. It has to maintain the data needed for evaluating the tree-policy. In the simplest case of MC backups the BACKUP procedure maintains visit counts ns, action counts n(s,a), and an estimate of the expected return R̂(s,a) by accumulating the average ofR. In the more recently suggested DP backups (Keller We use states and actions as subscripts to remain consistent with the MCTS setting. and Helmert 2013) the BACKUP procedure also maintains a transition model and an estimate of the expected immediate reward that are then used to calculate R̂(s,a) while the return samples R are ignored. MC and DP backups have significantly different characteristics that are subject of ongoing research (Feldman and Domshlak 2014a). Recently, temporal difference learning and function approximation have also been proposed as backup methods (Silver, Sutton, and Müller 2012; Guez et al. 2014). It has also been suggested to use different backup methods depending on the empirical variance of returns (Bnaya et al. 2015). When attempting to use variance information in MCTS a major problem arises because the variance of the return is usually not maintained by the BACKUP procedure. As we discuss in Section Variance Backups, for MC backups the extension is straightforward whereas for DP backups this is not the case. The combination of variance-based treepolicies with DP backups has therefore not been possible so far. In this paper we close this gap by deriving general update equations for the variance with DP backups. In conclusion, while the UCB-V policy has been established for bandits, no convergence proof for its use in MCTS exists to date. Furthermore, DP backups have to date not been extended to include variance updates thus limiting the applicability of UCB-V and other variance-based methods in MCTS. Reinforcement Learning The explor

[1]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[2]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[3]  Shie Mannor,et al.  Bayesian Reinforcement Learning , 2012, Reinforcement Learning.

[4]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[5]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[6]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[8]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[9]  Carmel Domshlak,et al.  Monte-Carlo Tree Search: To MC or to DP? , 2014, ECAI.

[10]  Richard S. Sutton,et al.  Temporal-difference search in computer Go , 2012, Machine Learning.

[11]  Abdallah Saffidine,et al.  UCD: Upper Confidence Bound for Rooted Directed Acyclic Graphs , 2010 .

[12]  Rami Puzis,et al.  Confidence Backup Updates for Aggregating MDP State Values in Monte-Carlo Tree Search , 2015, SOCS.

[13]  Malte Helmert,et al.  Trial-Based Heuristic Tree Search for Finite Horizon MDPs , 2013, ICAPS.

[14]  Peter Dayan,et al.  Bayes-Adaptive Simulation-based Search with Value Function Approximation , 2014, NIPS.

[15]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[16]  Anne S. Hawkins Bandit Problems—Sequential Allocation of Experiments , 1987 .

[17]  Carmel Domshlak,et al.  On MABs and Separation of Concerns in Monte-Carlo Planning for MDPs , 2014, ICAPS.