论文信息 - Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized sweeping (PS) and its variants are model-based reinforcement-learning algorithms that have demonstrated superior performance in terms of computational and experience e‐ciency in practice. This note establishes the flrst|to the best of our knowledge|formal proof of convergence to the optimal value function when they are used as planning algorithms. We also describe applications of this result to provably e‐cient model-based reinforcement learning in the PAC-MDP framework. We do not address the issue of convergence rate in the present paper.

Lihong Li | Michael L. Littman

[1] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[2] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3] David Andre,et al. Generalized Prioritized Sweeping , 1997, NIPS.

[4] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[5] Lihong Li,et al. Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[6] Jing Peng,et al. Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[7] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[8] Kevin D. Seppi,et al. Prioritization Methods for Accelerating MDP Solvers , 2005, J. Mach. Learn. Res..

[9] Andrew W. Moore,et al. Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[10] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.