Integrating Sample-Based Planning and Model-Based Reinforcement Learning

Recent advancements in model-based reinforcement learning have shown that the dynamics of many structured domains (e.g. DBNs) can be learned with tractable sample complexity, despite their exponentially large state spaces. Unfortunately, these algorithms all require access to a planner that computes a near optimal policy, and while many traditional MDP algorithms make this guarantee, their computation time grows with the number of states. We show how to replace these over-matched planners with a class of sample-based planners—whose computation time is independent of the number of states—without sacrificing the sample-efficiency guarantees of the overall learning algorithms. To do so, we define sufficient criteria for a sample-based planner to be used in such a learning system and analyze two popular sample-based approaches from the literature. We also introduce our own sample-based planner, which combines the strategies from these algorithms and still meets the criteria for integration into our learning system. In doing so, we define the first complete RL solution for compactly represented (exponentially sized) state spaces with efficiently learnable dynamics that is both sample efficient and whose computation time does not grow rapidly with the number of states.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[3]  Craig Boutilier,et al.  Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[4]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Olivier Sigaud,et al.  Learning the structure of Factored Markov Decision Processes in reinforcement learning problems , 2006, ICML.

[7]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[8]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[9]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[10]  Maurice Bruynooghe,et al.  Online Learning and Exploiting Relational Models in Reinforcement Learning , 2007, IJCAI.

[11]  L. P. Kaelbling,et al.  Learning Symbolic Models of Stochastic Domains , 2007, J. Artif. Intell. Res..

[12]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[13]  Richard S. Sutton,et al.  Sample-based learning and search with permanent and transient memories , 2008, ICML '08.

[14]  Alan Fern,et al.  UCT for Tactical Assault Planning in Real-Time Strategy Games , 2009, IJCAI.

[15]  Marc Toussaint,et al.  Approximate inference for planning in stochastic relational worlds , 2009, ICML '09.

[16]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[17]  Thomas J. Walsh,et al.  Exploring compact reinforcement-learning representations with linear regression , 2009, UAI.

[18]  Michael L. Littman,et al.  A unifying framework for computational reinforcement learning theory , 2009 .