Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
暂无分享,去创建一个
[1] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .
[2] J. MacQueen. A MODIFIED DYNAMIC PROGRAMMING METHOD FOR MARKOVIAN DECISION PROBLEMS , 1966 .
[3] Leslie G. Valiant,et al. Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.
[4] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[5] Nicolò Cesa-Bianchi,et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.
[6] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..
[7] Anders R. Kristensen,et al. Dynamic programming and Markov decision processes , 1996 .
[8] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.
[9] Andrew G. Barto,et al. Reinforcement learning , 1998 .
[10] Michael Kearns,et al. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.
[11] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..
[12] Boaz Patt-Shamir,et al. Buffer overflow management in QoS switches , 2001, STOC '01.
[13] Y. Freund,et al. The non-stochastic multi-armed bandit problem , 2001 .
[14] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..
[15] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..
[16] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.
[17] John N. Tsitsiklis,et al. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..
[18] Satinder Singh,et al. An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.
[19] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.
[20] Yishay Mansour,et al. Competitive queue policies for differentiated services , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).
[21] Csaba Szepesvári,et al. Finite time bounds for sampling based fitted value iteration , 2005, ICML.
[22] Liming Xiang,et al. Kernel-Based Reinforcement Learning , 2006, ICIC.
[23] H. Robbins. Some aspects of the sequential design of experiments , 1952 .
[24] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .