Optimal $\delta$-Correct Best-Arm Selection for General Distributions

Given a finite set of unknown distributions, or arms, that can be sampled, we consider the problem of identifying the one with the largest mean using a delta-correct algorithm (an adaptive, sequential algorithm that restricts the probability of error to a specified delta) that has minimum sample complexity. Lower bounds for delta-correct algorithms are well known. Delta-correct algorithms that match the lower bound asymptotically as delta reduces to zero have been previously developed when arm distributions are restricted to a single parameter exponential family. In this paper, we first observe a negative result that some restrictions are essential, as otherwise under a delta-correct algorithm, distributions with unbounded support would require an infinite number of samples in expectation. We then propose a delta-correct algorithm that matches the lower bound as delta reduces to zero under the mild restriction that a known bound on the expectation of a non-negative, continuous, increasing convex function (for example, the squared moment) of the underlying random variables, exists. We also propose batch processing and identify near-optimal batch sizes to substantially speed up the proposed algorithm. The best-arm problem has many learning applications, including recommendation systems and product selection. It is also a well studied classic problem in the simulation community.

[1]  L. J. Savage,et al.  The nonexistence of certain statistical procedures in nonparametric problems , 1956 .

[2]  H. Chernoff Sequential Design of Experiments , 1959 .

[3]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[4]  E. Paulson A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .

[5]  P. Billingsley,et al.  Convergence of Probability Measures , 1970, The Mathematical Gazette.

[6]  Robert E. Bechhofer,et al.  Sequential identification and ranking procedures : with special reference to Koopman-Darmois populations , 1970 .

[7]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[8]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[9]  A. Fiacco,et al.  Sensitivity and stability analysis for nonlinear programming , 1991 .

[10]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[11]  Yu-Chi Ho,et al.  Ordinal optimization of DEDS , 1992, Discret. Event Dyn. Syst..

[12]  L. Dai Convergence properties of ordinal comparison in the simulation of discrete event dynamic systems , 1995 .

[13]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[14]  R. Sundaram A First Course in Optimization Theory , 1996 .

[15]  Chun-Hung Chen,et al.  Simulation Budget Allocation for Further Enhancing the Efficiency of Ordinal Optimization , 2000, Discret. Event Dyn. Syst..

[16]  Barry L. Nelson,et al.  A fully sequential procedure for indifference-zone selection in simulation , 2001, TOMC.

[17]  A. Müller,et al.  Comparison Methods for Stochastic Models and Risks , 2002 .

[18]  C. Villani Topics in Optimal Transportation , 2003 .

[19]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[20]  Peter W. Glynn,et al.  A large deviations perspective on ordinal optimization , 2004, Proceedings of the 2004 Winter Simulation Conference, 2004..

[21]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[22]  Akimichi Takemura,et al.  An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[23]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[24]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[25]  Akimichi Takemura,et al.  An asymptotically optimal policy for finite support models in the multiarmed bandit problem , 2009, Machine Learning.

[26]  Ambuj Tewari,et al.  PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[27]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[28]  Alexandre Proutière,et al.  Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms , 2014, COLT.

[29]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[30]  Akimichi Takemura,et al.  Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards , 2015, J. Mach. Learn. Res..

[31]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[32]  Daniel Russo,et al.  Simple Bayesian Algorithms for Best Arm Identification , 2016, COLT.

[33]  Aurélien Garivier,et al.  Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[34]  Subhashini Krishnasamy,et al.  Sample complexity of partition identification using multi-armed bandits , 2018, COLT.