论文信息 - Optimal $\delta$-Correct Best-Arm Selection for General Distributions - 字舞流文

Optimal $\delta$-Correct Best-Arm Selection for General Distributions

Given a finite set of unknown distributions, or arms, that can be sampled, we consider the problem of identifying the one with the largest mean using a delta-correct algorithm (an adaptive, sequential algorithm that restricts the probability of error to a specified delta) that has minimum sample complexity. Lower bounds for delta-correct algorithms are well known. Delta-correct algorithms that match the lower bound asymptotically as delta reduces to zero have been previously developed when arm distributions are restricted to a single parameter exponential family. In this paper, we first observe a negative result that some restrictions are essential, as otherwise under a delta-correct algorithm, distributions with unbounded support would require an infinite number of samples in expectation. We then propose a delta-correct algorithm that matches the lower bound as delta reduces to zero under the mild restriction that a known bound on the expectation of a non-negative, continuous, increasing convex function (for example, the squared moment) of the underlying random variables, exists. We also propose batch processing and identify near-optimal batch sizes to substantially speed up the proposed algorithm. The best-arm problem has many learning applications, including recommendation systems and product selection. It is also a well studied classic problem in the simulation community.

Peter W. Glynn | Sandeep Juneja | Shubhada Agrawal | P. Glynn | S. Juneja | Shubhada Agrawal

[1] L. J. Savage,et al. The nonexistence of certain statistical procedures in nonparametric problems , 1956 .

[2] H. Chernoff. Sequential Design of Experiments , 1959 .

[3] E. Lehmann. Testing Statistical Hypotheses , 1960 .

[4] E. Paulson. A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .

[5] P. Billingsley,et al. Convergence of Probability Measures , 1970, The Mathematical Gazette.

[6] Robert E. Bechhofer,et al. Sequential identification and ranking procedures : with special reference to Koopman-Darmois populations , 1970 .

[7] D. Luenberger. Optimization by Vector Space Methods , 1968 .

[8] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .

[9] A. Fiacco,et al. Sensitivity and stability analysis for nonlinear programming , 1991 .

[10] David Williams,et al. Probability with Martingales , 1991, Cambridge mathematical textbooks.

[11] Yu-Chi Ho,et al. Ordinal optimization of DEDS , 1992, Discret. Event Dyn. Syst..

[12] L. Dai. Convergence properties of ordinal comparison in the simulation of discrete event dynamic systems , 1995 .

[13] A. Burnetas,et al. Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[14] R. Sundaram. A First Course in Optimization Theory , 1996 .

[15] Chun-Hung Chen,et al. Simulation Budget Allocation for Further Enhancing the Efficiency of Ordinal Optimization , 2000, Discret. Event Dyn. Syst..

[16] Barry L. Nelson,et al. A fully sequential procedure for indifference-zone selection in simulation , 2001, TOMC.

[17] A. Müller,et al. Comparison Methods for Stochastic Models and Risks , 2002 .

[18] C. Villani. Topics in Optimal Transportation , 2003 .

[19] John N. Tsitsiklis,et al. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[20] Peter W. Glynn,et al. A large deviations perspective on ordinal optimization , 2004, Proceedings of the 2004 Winter Simulation Conference, 2004..

[21] Shie Mannor,et al. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[22] Akimichi Takemura,et al. An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[23] Dominik D. Freydenberger,et al. Can We Learn to Gamble Efficiently? , 2010, COLT.

[24] Rémi Munos,et al. Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[25] Akimichi Takemura,et al. An asymptotically optimal policy for finite support models in the multiarmed bandit problem , 2009, Machine Learning.

[26] Ambuj Tewari,et al. PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[27] R. Munos,et al. Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[28] Alexandre Proutière,et al. Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms , 2014, COLT.

[29] Matthew Malloy,et al. lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[30] Akimichi Takemura,et al. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards , 2015, J. Mach. Learn. Res..

[31] Aurélien Garivier,et al. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[32] Daniel Russo,et al. Simple Bayesian Algorithms for Best Arm Identification , 2016, COLT.

[33] Aurélien Garivier,et al. Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[34] Subhashini Krishnasamy,et al. Sample complexity of partition identification using multi-armed bandits , 2018, COLT.