cient Planning in MDPs by Small Backups

E cient planning plays a crucial role in model-based reinforcement learning. Traditionally, the main planning operation is a full backup based on the current estimates of the successor states. Consequently, its computation time is proportional to the number of successor states. In this paper, we introduce a new planning backup that uses only the current value of a single successor state and has a computation time independent of the number of successor states. This new backup, which we call a small backup, opens the door to a new class of model-based reinforcement learning methods that exhibit much finer control over their planning process than traditional methods. We empirically demonstrate that this increased flexibility allows for more e cient planning by showing that an implementation of prioritized sweeping based on small backups achieves a substantial performance improvement over classical implementations.

[1]  R. Bellman A Markovian Decision Process , 1957 .

[2]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real , 1993 .

[3]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[4]  J. Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.

[5]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[6]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[7]  David Andre,et al.  Generalized Prioritized Sweeping , 1997, NIPS.

[8]  Jürgen Schmidhuber,et al.  Efficient model-based exploration , 1998 .

[9]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[10]  Blai Bonet,et al.  Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  Geoffrey J. Gordon,et al.  Fast Exact Planning in Markov Decision Processes , 2005, ICAPS.

[13]  Jesse Hoey,et al.  Efficient planning in R-max , 2011, AAMAS.

[14]  Shimon Whiteson,et al.  Exploiting Best-Match Equations for Efficient Reinforcement Learning , 2011, J. Mach. Learn. Res..

[15]  Shimon Whiteson,et al.  V-MAX: tempered optimism for better PAC reinforcement learning , 2012, AAMAS.