论文信息 - Best Arm Identification in Multi-Armed Bandits

Best Arm Identification in Multi-Armed Bandits

We consider the problem of ﬁnding the best arm in a stochastic multi-armed bandit game. The regret of a forecaster is here deﬁned by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm based on successive rejects. We show that these algorithms are essentially optimal since their regret decreases exponentially at a rate which is, up to a logarithmic factor, the best possible. However, while the UCB policy needs the tuning of a parameter depending on the unobservable hardness of the task, the successive rejects policy beneﬁts from being parameter-free, and also independent of the scaling of the rewards. As a by-product of our analysis, we show that identifying the best arm (when it is unique) requires a number of samples of order (up to a log( K ) factor) ∑ i 1 / ∆ 2 i , where the sum is on the suboptimal arms and ∆ i represents the difference between the mean reward of the best arm and the one of arm i . This generalizes the well-known fact that one needs of order of 1 / ∆ 2 samples to differentiate the means of two distributions with gap ∆ .

R. Munos | Jean-Yves Audibert | Sébastien Bubeck | Sébastien Bubeck | Sébastien Bubeck

[1] Rémi Munos,et al. Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[2] Csaba Szepesvári,et al. Empirical Bernstein stopping , 2008, ICML '08.

[3] Shie Mannor,et al. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[4] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[5] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6] Osamu Watanabe,et al. Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Data Mining and Knowledge Discovery.

[7] Andrew W. Moore,et al. Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[8] H. Robbins. Some aspects of the sequential design of experiments , 1952 .