PAC-inspired Option Discovery in Lifelong Reinforcement Learning

A key goal of AI is to create lifelong learning agents that can leverage prior experience to improve performance on later tasks. In reinforcement-learning problems, one way to summarize prior experience for future use is through options, which are temporally extended actions (subpolicies) for how to behave. Options can then be used to potentially accelerate learning in new reinforcement learning tasks. In this work, we provide the first formal analysis of the sample complexity, a measure of learning speed, of reinforcement learning with options. This analysis helps shed light on some interesting prior empirical results on when and how options may accelerate learning. We then quantify the benefit of options in reducing sample complexity of a lifelong learning agent. Finally, the new theoretical insights inspire a novel option-discovery algorithm that aims at minimizing overall sample complexity in lifelong reinforcement learning.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[3]  Sebastian Thrun,et al.  Lifelong robot learning , 1993, Robotics Auton. Syst..

[4]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5]  Andrew G. Barto,et al.  PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.

[6]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[7]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[8]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[9]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[10]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[11]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[12]  Vishal Soni,et al.  Using Homomorphisms to Transfer Options across Continuous Reinforcement Learning Domains , 2006, AAAI.

[13]  Lihong Li,et al.  Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[14]  Andrew G. Barto,et al.  Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[15]  Alan Fern,et al.  Multi-task reinforcement learning: a hierarchical Bayesian approach , 2007, ICML '07.

[16]  Peter Stone,et al.  The utility of temporal abstraction in reinforcement learning , 2008, AAMAS.

[17]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[18]  Satinder P. Singh,et al.  Transfer via soft homomorphisms , 2009, AAMAS.

[19]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[20]  Alessandro Lazaric,et al.  Transfer from Multiple MDPs , 2011, NIPS.

[21]  Yoonsuck Choe,et al.  Directed Exploration in Reinforcement Learning with Transferred Knowledge , 2012, EWRL.

[22]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[23]  Lihong Li,et al.  Sample Complexity of Multi-task Reinforcement Learning , 2013, UAI.

[24]  Shie Mannor,et al.  Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations , 2014, ICML.