Finite-Time Regret Bounds for the Multiarmed Bandit Problem
暂无分享,去创建一个
[1] G. Lugosi,et al. Minimax lower bounds for the two-armed bandit problem , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.
[2] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..
[3] R. Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.
[4] Nicolò Cesa-Bianchi,et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.
[5] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..
[6] Michael O. Duff,et al. Q-Learning for Bandit Problems , 1995, ICML.
[7] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..
[8] P. Varaiya,et al. Multi-Armed bandit problem revisited , 1994 .
[9] Leslie Pack Kaelbling,et al. Learning in embedded systems , 1993 .
[10] Bruce E. Hajek,et al. Cooling Schedules for Optimal Annealing , 1988, Math. Oper. Res..
[11] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .
[12] V. Cerný. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm , 1985 .
[13] C. D. Gelatt,et al. Optimization by Simulated Annealing , 1983, Science.
[14] J. Neveu,et al. Discrete Parameter Martingales , 1975 .
[15] W. Hoeffding. Probability inequalities for sum of bounded random variables , 1963 .