Optimal Simple Regret in Bayesian Best Arm Identification

We consider Bayesian best arm identification in the multi-armed bandit problem. Assuming certain continuity conditions of the prior, we characterize the rate of the Bayesian simple regret. Differing from Bayesian regret minimization (Lai, 1987), the leading factor in Bayesian simple regret derives from the region where the gap between optimal and sub-optimal arms is smaller than √ log T T . We propose a simple and easy-tocompute algorithm with its leading factor matches with the lower bound up to a constant factor; simulation results support our theoretical findings.

[1]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[2]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[3]  Yu-Chi Ho,et al.  Ordinal optimization of DEDS , 1992, Discret. Event Dyn. Syst..

[4]  Akimichi Takemura,et al.  Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards , 2015, J. Mach. Learn. Res..

[5]  Emilie Kaufmann,et al.  Analysis of bayesian and frequentist strategies for sequential resource allocation. (Analyse de stratégies bayésiennes et fréquentistes pour l'allocation séquentielle de ressources) , 2014 .

[6]  Peter I. Frazier,et al.  A Tutorial on Bayesian Optimization , 2018, ArXiv.

[7]  Michal Valko,et al.  Fixed-Confidence Guarantees for Bayesian Best-Arm Identification , 2019, AISTATS.

[8]  R. Weber On the Gittins Index for Multiarmed Bandits , 1992 .

[9]  Diego Klabjan,et al.  Improving the Expected Improvement Algorithm , 2017, NIPS.

[10]  Daniel Russo,et al.  A Note on the Equivalence of Upper Confidence Bounds and Gittins Indices for Patient Agents , 2019, Oper. Res..

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Walter T. Federer,et al.  Sequential Design of Experiments , 1967 .

[13]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[14]  Alexandra Carpentier,et al.  Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem , 2016, COLT.

[15]  Rémi Munos,et al.  Thompson Sampling: An Optimal Finite Time Analysis , 2012, ArXiv.

[16]  Chun-Hung Chen,et al.  Simulation Budget Allocation for Further Enhancing the Efficiency of Ordinal Optimization , 2000, Discret. Event Dyn. Syst..

[17]  Sattar Vakili,et al.  Optimal Order Simple Regret for Gaussian Process Bandits , 2021, NeurIPS.

[18]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[19]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[20]  Adam D. Bull,et al.  Convergence Rates of Efficient Global Optimization Algorithms , 2011, J. Mach. Learn. Res..

[21]  Ole-Christoffer Granmo,et al.  A Bayesian Learning Automaton for Solving Two-Armed Bernoulli Bandit Problems , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[22]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[23]  Ilya O. Ryzhov,et al.  On the Convergence Rates of Expected Improvement Methods , 2016, Oper. Res..

[24]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[25]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[26]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[27]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[28]  Daniel Russo,et al.  Simple Bayesian Algorithms for Best Arm Identification , 2016, COLT.

[29]  Andrew W. Moore,et al.  The Racing Algorithm: Model Selection for Lazy Learners , 1997, Artificial Intelligence Review.

[30]  Peter W. Glynn,et al.  A large deviations perspective on ordinal optimization , 2004, Proceedings of the 2004 Winter Simulation Conference, 2004..

[31]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[32]  E. Paulson A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .