From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning

This work covers several aspects of the optimism in the face of uncertainty principle applied to large scale optimization problems under finite numerical budget. The initial motivation for the research reported here originated from the empirical success of the so-called Monte-Carlo Tree Search method popularized in computer-go and further extended to many other games as well as optimization and planning problems. Our objective is to contribute to the development of theoretical foundations of the field by characterizing the complexity of the underlying optimization problems and designing efficient algorithms with performance guarantees. The main idea presented here is that it is possible to decompose a complex decision making problem (such as an optimization problem in a large search space) into a sequence of elementary decisions, where each decision of the sequence is solved using a (stochastic) multi-armed bandit (simple mathematical model for decision making in stochastic environments). This so-called hierarchical bandit approach (where the reward observed by a bandit in the hierarchy is itself the return of another bandit at a deeper level) possesses the nice feature of starting the exploration by a quasi-uniform sampling of the space and then focusing progressively on the most promising area, at different scales, according to the evaluations observed so far, and eventually performing a local search around the global optima of the function. The performance of the method is assessed in terms of the optimality of the returned solution as a function of the number of function evaluations. Our main contribution to the field of function optimization is a class of hierarchical optimistic algorithms designed for general search spaces (such as metric spaces, trees, graphs, Euclidean spaces, ...) with different algorithmic instantiations depending on whether the evaluations are noisy or noiseless and whether some measure of the ''smoothness'' of the function is known or unknown. The performance of the algorithms depend on the local behavior of the function around its global optima expressed in terms of the quantity of near-optimal states measured with some metric. If this local smoothness of the function is known then one can design very efficient optimization algorithms (with convergence rate independent of the space dimension), and when it is not known, we can build adaptive techniques that can, in some cases, perform almost as well as when it is known.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  W. R. Thompson On the Theory of Apportionment , 1935 .

[3]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[4]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[5]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Lamberto Cesari,et al.  Optimization-Theory And Applications , 1983 .

[7]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[8]  Henryk Wozniakowski,et al.  Information-based complexity , 1987, Nature.

[9]  H. Woxniakowski Information-Based Complexity , 1988 .

[10]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[11]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[12]  A. Neumaier Interval methods for systems of equations , 1990 .

[13]  Bruce Abramson,et al.  Expected-Outcome: A General Model of Static Evaluation , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Eldon Hansen,et al.  Global optimization using interval analysis , 1992, Pure and applied mathematics.

[15]  J. Banks,et al.  Denumerable-Armed Bandits , 1992 .

[16]  R. Horst,et al.  Global Optimization: Deterministic Approaches , 1992 .

[17]  Bernd Brügmann Max-Planck Monte Carlo Go , 1993 .

[18]  C. D. Perttunen,et al.  Lipschitzian optimization without the Lipschitz constant , 1993 .

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  R. Agrawal The Continuum-Armed Bandit Problem , 1995 .

[21]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[22]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[23]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[24]  R. B. Kearfott Rigorous Global Search: Continuous Problems , 1996 .

[25]  Robert W. Chen,et al.  Bandit problems with infinitely many arms , 1997 .

[26]  J D Pinter,et al.  Global Optimization in Action—Continuous and Lipschitz Optimization: Algorithms, Implementations and Applications , 2010 .

[27]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[28]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[29]  Aleksandra Eric,et al.  A Heuristic Search Algorithm for Markov Decision Problems , 1999 .

[30]  Y. D. Sergeyev,et al.  Global Optimization with Non-Convex Constraints - Sequential and Parallel Algorithms (Nonconvex Optimization and its Applications Volume 45) (Nonconvex Optimization and Its Applications) , 2000 .

[31]  C. T. Kelley,et al.  Modifications of the direct algorithm , 2001 .

[32]  Bruno Bouzy,et al.  Computer Go: An AI oriented survey , 2001, Artif. Intell..

[33]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[34]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[35]  Jan M. Maciejowski,et al.  Predictive control : with constraints , 2002 .

[36]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[37]  Marko Bacic,et al.  Model predictive control , 2003 .

[38]  Bruno Bouzy,et al.  Monte-Carlo Go Developments , 2003, ACG.

[39]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[40]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[41]  Frédérick Garcia,et al.  On-Line Search for Solving Markov Decision Processes via Heuristic Sampling , 2004, ECAI.

[42]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[43]  D. Finkel,et al.  Convergence analysis of the direct algorithm , 2004 .

[44]  Frederick Garcia On-line search for solving large Markov de-cision processes , 2004 .

[45]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[46]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[47]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[48]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[49]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[50]  Steven M. LaValle,et al.  Planning algorithms , 2006 .

[51]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[52]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[53]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[54]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[55]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[56]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[57]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[58]  Tapio Elomaa,et al.  Following the Perturbed Leader to Gamble at Multi-armed Bandits , 2007, ALT.

[59]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[60]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[61]  Sylvain Gelly,et al.  Modifications of UCT and sequence-like simulations for Monte-Carlo Go , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[62]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[63]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[64]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[65]  Louis Wehenkel,et al.  Lazy Planning under Uncertainty by Optimizing Decisions on an Ensemble of Incomplete Disturbance Trees , 2008, EWRL.

[66]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[67]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[68]  Rémi Munos,et al.  Optimistic Planning of Deterministic Systems , 2008, EWRL.

[69]  Tzung-Pei Hong,et al.  The Computational Intelligence of MoGo Revealed in Taiwan's Computer Go Tournaments , 2009, IEEE Transactions on Computational Intelligence and AI in Games.

[70]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[71]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[72]  David Silver,et al.  Reinforcement learning and simulation-based search in computer go , 2009 .

[73]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[74]  Akimichi Takemura,et al.  An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[75]  Thomas Hérault,et al.  Scalability and Parallelization of Monte-Carlo Tree Search , 2010, Computers and Games.

[76]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[77]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[78]  Rémi Munos,et al.  Open Loop Optimistic Planning , 2010, COLT.

[79]  Sébastien Bubeck Bandits Games and Clustering Foundations , 2010 .

[80]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[81]  Olivier Teytaud,et al.  Biasing Monte-Carlo Simulations through RAVE Values , 2010, Computers and Games.

[82]  Guillaume Maurice Jean-Bernard Chaslot Chaslot,et al.  Monte-Carlo Tree Search , 2010 .

[83]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[84]  Olivier Buffet,et al.  Markov Decision Processes in Artificial Intelligence , 2010 .

[85]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[86]  Odalric-Ambrym Maillard,et al.  (APPRENTISSAGE SÉQUENTIEL : Bandits, Statistique et Renforcement , 2011 .

[87]  Sham M. Kakade,et al.  Stochastic Convex Optimization with Bandit Feedback , 2011, SIAM J. Optim..

[88]  Aleksandrs Slivkins,et al.  Multi-armed bandits on implicit metric spaces , 2011, NIPS.

[89]  Michael L. Littman,et al.  Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[90]  Robert Babuÿ,et al.  OPTIMISTIC PLANNING IN MARKOV DECISION PROCESSES , 2011 .

[91]  Csaba Szepesvári,et al.  –armed Bandits , 2022 .

[92]  Rémi Munos,et al.  Optimistic Optimization of Deterministic Functions , 2011, NIPS 2011.

[93]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[94]  Anne Auger,et al.  Theory of Randomized Search Heuristics: Foundations and Recent Developments , 2011, Theory of Randomized Search Heuristics.

[95]  Rémi Munos,et al.  A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , 2011, COLT.

[96]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[97]  Jia Yuan Yu,et al.  Lipschitz Bandits without the Lipschitz Constant , 2011, ALT.

[98]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[99]  Akimichi Takemura,et al.  An asymptotically optimal policy for finite support models in the multiarmed bandit problem , 2009, Machine Learning.

[100]  U. Rieder,et al.  Markov Decision Processes with Applications to Finance , 2011 .

[101]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[102]  Bart De Schutter,et al.  Optimistic planning for sparsely stochastic systems , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[103]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[104]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[105]  Rémi Munos,et al.  Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit , 2012, AISTATS.

[106]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[107]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[108]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[109]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[110]  Csaba Szepesvári,et al.  Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits , 2012, AISTATS.

[111]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[112]  Lucian Busoniu,et al.  Optimistic planning for Markov decision processes , 2012, AISTATS.

[113]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[114]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[115]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[116]  Rémi Munos,et al.  Stochastic Simultaneous Optimistic Optimization , 2013, ICML.

[117]  Lucian Busoniu,et al.  Optimistic planning for belief-augmented Markov Decision Processes , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[118]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[119]  Steven I. Marcus,et al.  Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .

[120]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[121]  David Q. Mayne,et al.  Model predictive control: Recent developments and future promise , 2014, Autom..

[122]  Adam D. Bull,et al.  Adaptive-treed bandits , 2013, 1302.2489.

[123]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .