We consider the problem of finding the best arm in a stochastic multi-armed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm based on successive rejects. We show that these algorithms are essentially optimal since their regret decreases exponentially at a rate which is, up to a logarithmic factor, the best possible. However, while the UCB policy needs the tuning of a parameter depending on the unobservable hardness of the task, the successive rejects policy benefits from being parameter-free, and also independent of the scaling of the rewards. As a by-product of our analysis, we show that identifying the best arm (when it is unique) requires a number of samples of order (up to a log( K ) factor) ∑ i 1 / ∆ 2 i , where the sum is on the suboptimal arms and ∆ i represents the difference between the mean reward of the best arm and the one of arm i . This generalizes the well-known fact that one needs of order of 1 / ∆ 2 samples to differentiate the means of two distributions with gap ∆ .
[1]
Rémi Munos,et al.
Pure Exploration in Multi-armed Bandits Problems
,
2009,
ALT.
[2]
Csaba Szepesvári,et al.
Empirical Bernstein stopping
,
2008,
ICML '08.
[3]
Shie Mannor,et al.
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
,
2006,
J. Mach. Learn. Res..
[4]
Gábor Lugosi,et al.
Prediction, learning, and games
,
2006
.
[5]
Peter Auer,et al.
Finite-time Analysis of the Multiarmed Bandit Problem
,
2002,
Machine Learning.
[6]
Osamu Watanabe,et al.
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms
,
1999,
Data Mining and Knowledge Discovery.
[7]
Andrew W. Moore,et al.
Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation
,
1993,
NIPS.
[8]
H. Robbins.
Some aspects of the sequential design of experiments
,
1952
.