Variance estimates and exploration function in multi-armed bandit

Algorithms based on upper-confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. This paper considers a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. The paper provides a first analysis of the expected regret of such algorithms and of the concentration of the regret of upper confidence bounds algorithm. As expected, these analyses of the regret suggest that the algorithm that use the variance estimates can have a major advantage over its alternatives that do not use such estimates when the variances of the payoffs of the suboptimal arms are low. This work, however, reveals that the regret concentrates only at a polynomial rate. This holds for all the upper confidence bound based algorithms and for all bandit problems except those rare ones where with probability one the payoffs coming from the optimal arm are always larger than the expected payoff for the second best arm. Hence, although upper confidence bound bandit algorithms achieve logarithmic expected regret rates, a risk-averse decision maker may prefer some alternative algorithm. The paper also illustrates some of the results with computer simulations.