UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem

AbstractIn the stochastic multi-armed bandit problem we consider a modification of the UCB algorithm of Auer et al. [4]. For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in K-armed bandits after T trials is bounded by const · $$ \frac{{K\log (T)}} {\Delta } $$, where Δ measures the distance between a suboptimal arm and the optimal arm, for the modified UCB algorithm we show an upper bound on the regret of const · $$ \frac{{K\log (T\Delta ^2 )}} {\Delta } $$.