论文信息 - Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases slowly with time. In their seminal work on this problem, Lai and Robbins had obtained a O(log n) lower bound on the regret with a constant that depends on the Kullback–Leibler number. They also constructed policies for some specific families of probability distributions (including exponential families) that achieved the lower bound. In this paper we construct index policies that depend on the rewards from each arm only through their sample mean. These policies are computationally much simpler and are also applicable much more generally. They achieve a O(log n) regret with a constant that is also based on the Kullback–Leibler number. This constant turns out to be optimal for one-parameter exponential families; however, in general it is derived from the optimal one via a ‘contraction' principle. Our results rely entirely on a few key lemmas from the theory of large deviations.

R. Agrawal

[1] R. Ellis,et al. Entropy, large deviations, and statistical mechanics , 1985 .

[2] Patrick Billingsley,et al. Probability and Measure. , 1986 .

[3] L. Brown. Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[4] J. Walrand,et al. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[5] T. Lai. Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[6] D. Teneketzis,et al. Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost , 1988 .

[7] D. Teneketzis,et al. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Paramet , 1988 .

[8] R. Agrawal,et al. Certainty equivalence control with forcing: revisited , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[9] R. Agrawal,et al. Asymptotically efficient adaptive allocation schemes for controlled Markov chains: finite parameter space , 1989 .

[10] R. Agrawal. Adaptive Control of Markov Chains under the Weak Accessibility Condition , 1991 .

[11] R. Agrawal,et al. Multi-armed bandit problems with multiple plays and switching cost , 1990 .

[12] A. Dembo,et al. Large Deviation Techniques and Applications. , 1994 .

[13] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .