Simple Bayesian Algorithms for Best Arm Identification

This paper considers the optimal adaptive allocation of measurement effort for identifying the best among a finite set of options or designs. An experimenter sequentially chooses designs to measure and observes noisy signals of their quality with the goal of confidently identifying the best design after a small number of measurements. This paper proposes three simple and intuitive Bayesian algorithms for adaptively allocating measurement effort, and formalizes a sense in which these seemingly naive rules are the best possible. One proposal is top-two probability sampling, which computes the two designs with the highest posterior probability of being optimal, and then randomizes to select among these two. One is a variant of top-two sampling which considers not only the probability a design is optimal, but the expected amount by which its quality exceeds that of other designs. The final algorithm is a modified version of Thompson sampling that is tailored for identifying the best design. We prove that these simple algorithms satisfy a sharp optimality property. In a frequentist setting where the true quality of the designs is fixed, one hopes the posterior definitively identifies the optimal design, in the sense that that the posterior probability assigned to the event that some other design is optimal converges to zero as measurements are collected. We show that under the proposed algorithms this convergence occurs at an exponential rate, and the corresponding exponent is the best possible among all allocation

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  R. Bechhofer A Single-Sample Multiple Decision Procedure for Ranking Means of Normal Populations with known Variances , 1954 .

[3]  A. Albert The Sequential Design of Experiments for Infinitely Many States of Nature , 1961 .

[4]  J. Kiefer,et al.  Asymptotically Optimum Sequential Inference and Design , 1963 .

[5]  D. Freedman On the Asymptotic Behavior of Bayes' Estimates in the Discrete Case , 1963 .

[6]  E. Paulson A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .

[7]  Walter T. Federer,et al.  Sequential Design of Experiments , 1967 .

[8]  H. Chernoff Approaches in Sequential Design of Experiments , 1973 .

[9]  D. V. Gokhale,et al.  A Survey of Statistical Design and Linear Models. , 1976 .

[10]  Y. Rinott On two-stage selection procedures and related probability-inequalities , 1978 .

[11]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[12]  I. Johnstone,et al.  ASYMPTOTICALLY OPTIMAL PROCEDURES FOR SEQUENTIAL ADAPTIVE SELECTION OF THE BEST OF SEVERAL NORMAL MEANS , 1982 .

[13]  R. Keener Second Order Efficiency in the Sequential Design of Experiments , 1984 .

[14]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[15]  D. Freedman,et al.  On the consistency of Bayes estimates , 1986 .

[16]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[17]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[18]  S. Gupta,et al.  Bayesian look ahead one-stage sampling allocations for selection of the best population , 1996 .

[19]  L. Wasserman,et al.  The consistency of posterior distributions in nonparametric problems , 1999 .

[20]  Chun-Hung Chen,et al.  Simulation Budget Allocation for Further Enhancing the Efficiency of Ordinal Optimization , 2000, Discret. Event Dyn. Syst..

[21]  A. V. D. Vaart,et al.  Convergence rates of posterior distributions , 2000 .

[22]  Stephen E. Chick,et al.  New Two-Stage and Sequential Procedures for Selecting the Best Simulated System , 2001, Oper. Res..

[23]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[24]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[25]  Peter W. Glynn,et al.  A large deviations perspective on ordinal optimization , 2004, Proceedings of the 2004 Winter Simulation Conference, 2004..

[26]  D. Berry Bayesian Statistics and the Efficiency and Ethics of Clinical Trials , 2004 .

[27]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[28]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[29]  Barry L. Nelson,et al.  Recent advances in ranking and selection , 2007, 2007 Winter Simulation Conference.

[30]  Loo Hay Lee,et al.  Efficient Simulation Budget Allocation for Selecting an Optimal Subset , 2008, INFORMS J. Comput..

[31]  Warren B. Powell,et al.  A Knowledge-Gradient Policy for Sequential Information Collection , 2008, SIAM J. Control. Optim..

[32]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[33]  Stephen E. Chick,et al.  Economic Analysis of Simulation Selection Problems , 2009, Manag. Sci..

[34]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[35]  Jürgen Branke,et al.  Sequential Sampling to Myopically Maximize the Expected Value of Information , 2010, INFORMS J. Comput..

[36]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[37]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[38]  Peter I. Frazier,et al.  Sequential Sampling with Economics of Selection Procedures , 2012, Manag. Sci..

[39]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[40]  Tara Javidi,et al.  Active Sequential Hypothesis Testing , 2012, ArXiv.

[41]  Warren B. Powell,et al.  The Knowledge Gradient Algorithm for a General Class of Online Learning Problems , 2012, Oper. Res..

[42]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[43]  S. Kakade,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2012, IEEE Transactions on Information Theory.

[44]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[45]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[46]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[47]  Susan R. Hunter,et al.  Optimal Sampling Laws for Stochastically Constrained Simulation Optimization on Finite Sets , 2013, INFORMS J. Comput..

[48]  Shivaram Kalyanakrishnan,et al.  Information Complexity in Bandit Subset Selection , 2013, COLT.

[49]  George Atia,et al.  Controlled Sensing for Multihypothesis Testing , 2012, IEEE Transactions on Automatic Control.

[50]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[51]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[52]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[53]  Loo Hay Lee,et al.  Stochastically Constrained Ranking and Selection via SCORE , 2014, ACM Trans. Model. Comput. Simul..

[54]  Robert D. Nowak,et al.  Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[55]  Peter I. Frazier,et al.  A Fully Sequential Elimination Procedure for Indifference-Zone Ranking and Selection with Tight Bounds on Probability of Correct Selection , 2014, Oper. Res..

[56]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[57]  Benjamin Van Roy,et al.  Learning to Optimize via Information-Directed Sampling , 2014, NIPS.

[58]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[59]  P. Glynn,et al.  Ordinal optimization - empirical large deviations rate estimators, and stochastic multi-armed bandits , 2015 .

[60]  Barry L. Nelson,et al.  Discrete Optimization via Simulation , 2015 .

[61]  Loo Hay Lee,et al.  Ranking and Selection: Efficient Simulation Budget Allocation , 2015 .

[62]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.

[63]  Vivek F. Farias,et al.  Optimistic Gittins Indices , 2016, NIPS.

[64]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[65]  E. Kaufmann On Bayesian index policies for sequential resource allocation , 2016, 1601.01190.

[66]  Ilya O. Ryzhov,et al.  On the Convergence Rates of Expected Improvement Methods , 2016, Oper. Res..

[67]  Barry L. Nelson,et al.  Indifference-Zone-Free Selection of the Best , 2016, Oper. Res..

[68]  Aurélien Garivier,et al.  Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[69]  Diego Klabjan,et al.  Improving the Expected Improvement Algorithm , 2017, NIPS.

[70]  Susan R. Hunter,et al.  Efficient Ranking and Selection in Parallel Computing Environments , 2015, Oper. Res..

[71]  David Simchi-Levi,et al.  Online Network Revenue Management Using Thompson Sampling , 2017, Oper. Res..