Finite-Time Regret Bounds for the Multiarmed Bandit Problem

[1]  G. Lugosi,et al.  Minimax lower bounds for the two-armed bandit problem , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[2]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[3]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[4]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[5]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[6]  Michael O. Duff,et al.  Q-Learning for Bandit Problems , 1995, ICML.

[7]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[8]  P. Varaiya,et al.  Multi-Armed bandit problem revisited , 1994 .

[9]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[10]  Bruce E. Hajek,et al.  Cooling Schedules for Optimal Annealing , 1988, Math. Oper. Res..

[11]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[12]  V. Cerný Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm , 1985 .

[13]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[14]  J. Neveu,et al.  Discrete Parameter Martingales , 1975 .

[15]  W. Hoeffding Probability inequalities for sum of bounded random variables , 1963 .