Monte Carlo Tree Search guided by Symbolic Advice for MDPs

In this paper, we consider the online computation of a strategy that aims at optimizing the expected average reward in a Markov decision process. The strategy is computed with a receding horizon and using Monte Carlo tree search (MCTS). We augment the MCTS algorithm with the notion of symbolic advice, and show that its classical theoretical guarantees are maintained. Symbolic advice are used to bias the selection and simulation strategies of MCTS. We describe how to use QBF and SAT solvers to implement symbolic advice in an efficient way. We illustrate our new algorithm using the popular game Pac-Man and show that the performances of our algorithm exceed those of plain MCTS as well as the performances of human players.

[1]  Thomas A. Henzinger,et al.  Faster Statistical Model Checking for Unbounded Temporal Properties , 2016, TACAS.

[2]  Martha Palmer CS188 – Introduction to Artificial Intelligence , 2007 .

[3]  Jan Kretínský,et al.  Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints , 2018, CONCUR.

[4]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[5]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[6]  Christof Löding Lectures in Game Theory for Computer Scientists: Infinite Games and Automata Theory , 2011 .

[7]  Viktor Schuppan,et al.  Linear Encodings of Bounded LTL Model Checking , 2006, Log. Methods Comput. Sci..

[8]  Sanjit A. Seshia,et al.  On Parallel Scalable Uniform SAT Witness Generation , 2015, TACAS.

[9]  Mark H. M. Winands,et al.  MCTS-Minimax Hybrids , 2015, IEEE Transactions on Computational Intelligence and AI in Games.

[10]  H. Jaap van den Herik,et al.  Parallel Monte-Carlo Tree Search , 2008, Computers and Games.

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Krishnendu Chatterjee,et al.  Verification of Markov Decision Processes Using Learning Algorithms , 2014, ATVA.

[13]  Ufuk Topcu,et al.  Safe Reinforcement Learning via Shielding , 2017, AAAI.

[14]  Daniel Kroening,et al.  Cautious Reinforcement Learning with Logical Constraints , 2020, AAMAS.

[15]  Feng Xiao,et al.  Pruning in UCT Algorithm , 2010, 2010 International Conference on Technologies and Applications of Artificial Intelligence.

[16]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  Leonid Ryzhyk,et al.  Solving Games without Controllable Predecessor , 2014, CAV.

[19]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[20]  Sharad Malik,et al.  Propositional SAT Solving , 2018, Handbook of Model Checking.

[21]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[22]  Sanjit A. Seshia,et al.  Distribution-Aware Sampling and Weighted Model Counting for SAT , 2014, AAAI.

[23]  Gerald Tesauro,et al.  Monte-Carlo simulation balancing , 2009, ICML '09.

[24]  Krishnendu Chatterjee,et al.  Optimizing Expectation with Guarantees in POMDPs , 2017, AAAI.

[25]  Martina Seidl,et al.  A Survey on Applications of Quantified Boolean Formulas , 2019, 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI).