Balancing exploration and exploitation in reinforcement learning using a value of information criterion
暂无分享,去创建一个
[1] P. Dayan,et al. Exploration bonuses and dual control , 1996 .
[2] L. Goddard. Information Theory , 1962, Nature.
[3] Yang Liu,et al. A new Q-learning algorithm based on the metropolis criterion , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).
[4] Satinder Singh. Transfer of Learning by Composing Solutions of Elemental Sequential Tasks , 1992, Mach. Learn..
[5] Jürgen Schmidhuber,et al. Fast Online Q(λ) , 1998, Machine Learning.
[6] C. Atkeson,et al. Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real , 1993 .
[7] Djallel Bouneffouf,et al. Finite-time analysis of the multi-armed bandit problem with known trend , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).
[8] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..
[9] Nicky J Welton,et al. Value of Information , 2015, Medical decision making : an international journal of the Society for Medical Decision Making.
[10] David H. Wolpert,et al. Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..
[11] Peter Vrancx,et al. Reinforcement Learning: State-of-the-Art , 2012 .
[12] Longxin Lin. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.
[13] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..
[14] John N. Tsitsiklis,et al. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..
[15] Sebastian Thrun,et al. Active Exploration in Dynamic Environments , 1991, NIPS.
[16] Marco Wiering,et al. Reinforcement Learning , 2014, Adaptation, Learning, and Optimization.
[17] Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.
[18] Thomas M. Cover,et al. Elements of Information Theory , 2005 .
[19] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[20] Jie Chen,et al. Optimal Contraction Theorem for Exploration–Exploitation Tradeoff in Search and Optimization , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.
[21] G. Tesauro. Practical Issues in Temporal Difference Learning , 1992 .
[22] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.
[23] Peter Dayan,et al. Q-learning , 1992, Machine Learning.
[24] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.
[25] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..