Rate-Optimal Bayesian Simple Regret in Best Arm Identification

We consider best arm identification in the multiarmed bandit problem. Assuming certain continuity conditions of the prior, we characterize the rate of the Bayesian simple regret. Differing from Bayesian regret minimization, the leading term in the Bayesian simple regret derives from the region in which the gap between optimal and suboptimal arms is smaller than [Formula: see text]. We propose a simple and easy-to-compute algorithm with its leading term matching with the lower bound up to a constant factor; simulation results support our theoretical findings.

[1]  Junpei Komiyama,et al.  Minimax Optimal Algorithms for Fixed-Budget Best Arm Identification , 2022, Neural Information Processing Systems.

[2]  Daniel Russo,et al.  Adaptive Experimentation in the Presence of Exogenous Nonstationary Variation , 2022, 2202.09036.

[3]  Sattar Vakili,et al.  Optimal Order Simple Regret for Gaussian Process Bandits , 2021, NeurIPS.

[4]  L. J. Hong,et al.  Review on ranking and selection: A new perspective , 2020, Frontiers of Engineering Management.

[5]  Michal Valko,et al.  Fixed-Confidence Guarantees for Bayesian Best-Arm Identification , 2019, AISTATS.

[6]  Daniel Russo,et al.  A Note on the Equivalence of Upper Confidence Bounds and Gittins Indices for Patient Agents , 2019, Oper. Res..

[7]  Recent Advances in Optimization and Modeling of Contemporary Problems , 2018 .

[8]  P. Frazier Bayesian Optimization , 2018, Hyperparameter Optimization in Machine Learning.

[9]  Diego Klabjan,et al.  Improving the Expected Improvement Algorithm , 2017, NIPS.

[10]  Vahid Tarokh,et al.  On Sequential Elimination Algorithms for Best-Arm Identification in Multi-Armed Bandits , 2016, IEEE Transactions on Signal Processing.

[11]  Alexandra Carpentier,et al.  Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem , 2016, COLT.

[12]  Ilya O. Ryzhov,et al.  On the Convergence Rates of Expected Improvement Methods , 2016, Oper. Res..

[13]  Daniel Russo,et al.  Simple Bayesian Algorithms for Best Arm Identification , 2016, COLT.

[14]  Chun-Hung Chen,et al.  Dynamic Sampling Allocation and Design Selection , 2016, INFORMS J. Comput..

[15]  Emilie Kaufmann,et al.  Analysis of bayesian and frequentist strategies for sequential resource allocation. (Analyse de stratégies bayésiennes et fréquentistes pour l'allocation séquentielle de ressources) , 2014 .

[16]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[17]  Benjamin Van Roy,et al.  Learning to Optimize via Information-Directed Sampling , 2014, NIPS.

[18]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[19]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[20]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[21]  Adam D. Bull,et al.  Convergence Rates of Efficient Global Optimization Algorithms , 2011, J. Mach. Learn. Res..

[22]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT 2010.

[23]  Ole-Christoffer Granmo,et al.  A Bayesian Learning Automaton for Solving Two-Armed Bernoulli Bandit Problems , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[24]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[25]  Peter W. Glynn,et al.  A large deviations perspective on ordinal optimization , 2004, Proceedings of the 2004 Winter Simulation Conference, 2004..

[26]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[27]  Chun-Hung Chen,et al.  Simulation Budget Allocation for Further Enhancing the Efficiency of Ordinal Optimization , 2000, Discret. Event Dyn. Syst..

[28]  Andrew W. Moore,et al.  The Racing Algorithm: Model Selection for Lazy Learners , 1997, Artificial Intelligence Review.

[29]  R. Weber On the Gittins Index for Multiarmed Bandits , 1992 .

[30]  Yu-Chi Ho,et al.  Ordinal optimization of DEDS , 1992, Discret. Event Dyn. Syst..

[31]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[32]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[33]  Walter T. Federer,et al.  Sequential Design of Experiments , 1967 .

[34]  E. Paulson A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .

[35]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[36]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[37]  E. Kaufmann,et al.  Fixed-Con dence Guarantees for Bayesian Best-Arm Identi cation , 2020 .

[38]  Akimichi Takemura,et al.  Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards , 2015, J. Mach. Learn. Res..

[39]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[40]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .