Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences
暂无分享,去创建一个
[1] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .
[2] C. Watkins. Learning from delayed rewards , 1989 .
[3] Sebastian Thrun,et al. Efficient Exploration In Reinforcement Learning , 1992 .
[4] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .
[5] Junichiro Yoshimoto,et al. Control of exploitation-exploration meta-parameter in reinforcement learning , 2002, Neural Networks.
[6] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..
[7] Baruch Awerbuch,et al. Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.
[8] Rina Azoulay-Schwartz,et al. Exploitation vs. exploration: choosing a supplier in an environment of incomplete information , 2004, Decis. Support Syst..
[9] Mehryar Mohri,et al. Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.
[10] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[11] Warren B. Powell,et al. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.
[12] H. Robbins. Some aspects of the sequential design of experiments , 1952 .
[13] Gianluca Bontempi,et al. Improving the Exploration Strategy in Bandit Algorithms , 2008, LION.
[14] Verena Heidrich-Meisner. Interview with Richard S. Sutton , 2009, Künstliche Intell..