Algorithms based on upper-confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. This paper considers a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. The paper provides a first analysis of the expected regret of such algorithms and of the concentration of the regret of upper confidence bounds algorithm. As expected, these analyses of the regret suggest that the algorithm that use the variance estimates can have a major advantage over its alternatives that do not use such estimates when the variances of the payoffs of the suboptimal arms are low. This work, however, reveals that the regret concentrates only at a polynomial rate. This holds for all the upper confidence bound based algorithms and for all bandit problems except those rare ones where with probability one the payoffs coming from the optimal arm are always larger than the expected payoff for the second best arm. Hence, although upper confidence bound bandit algorithms achieve logarithmic expected regret rates, a risk-averse decision maker may prefer some alternative algorithm. The paper also illustrates some of the results with computer simulations.
[1]
W. R. Thompson.
ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES
,
1933
.
[2]
D. Freedman.
On Tail Probabilities for Martingales
,
1975
.
[3]
J. Bather,et al.
Multi‐Armed Bandit Allocation Indices
,
1990
.
[4]
R. Agrawal.
Sample mean based index policies by O(log n) regret for the multi-armed bandit problem
,
1995,
Advances in Applied Probability.
[5]
S. Yakowitz,et al.
Machine learning and nonparametric bandit theory
,
1995,
IEEE Trans. Autom. Control..
[6]
Peter Auer,et al.
Finite-time Analysis of the Multiarmed Bandit Problem
,
2002,
Machine Learning.
[7]
Csaba Szepesvári,et al.
Bandit Based Monte-Carlo Planning
,
2006,
ECML.
[8]
David Silver,et al.
Combining online and offline knowledge in UCT
,
2007,
ICML '07.
[9]
H. Robbins.
Some aspects of the sequential design of experiments
,
1952
.
[10]
T. L. Lai Andherbertrobbins.
Asymptotically Efficient Adaptive Allocation Rules
,
2022
.