The Expected-Length Model of Options

Effective options can make reinforcement learning easier by enhancing an agent’s ability to both explore in a targeted manner and plan further into the future. However, learning an appropriate model of an option’s dynamics in hard, requiring estimating a highly parameterized probability distribution. This paper introduces and motivates the ExpectedLength Model (ELM) for options, an alternate model for transition dynamics. We prove ELM is a (biased) estimator of the traditional MultiTime Model (MTM), but provide a non-vacuous bound on their deviation. We further prove that, in stochastic shortest path problems, ELM induces a value function that is sufficiently similar to the one induced by MTM, and is thus capable of supporting near-optimal behavior. We explore the practical utility of this option model experimentally, finding consistent support for the thesis that ELM is a suitable replacement for MTM. In some cases, we find ELM leads to more sample efficient learning, especially when options are arranged in a hierarchy.

[1]  Alessandro Lazaric,et al.  Exploration – Exploitation in MDPs with Options , 2016 .

[2]  Peter Stone,et al.  Hierarchical model-based reinforcement learning: R-max + MAXQ , 2008, ICML '08.

[3]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  Lihong Li,et al.  PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[6]  Leslie Pack Kaelbling,et al.  From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning , 2018, J. Artif. Intell. Res..

[7]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[8]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[9]  Doina Precup,et al.  Learning with Options that Terminate Off-Policy , 2017, AAAI.

[10]  Stefanie Tellex,et al.  Planning with Abstract Markov Decision Processes , 2017, ICAPS.

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[13]  Erik Talvitie,et al.  Self-Correcting Models for Model-Based Reinforcement Learning , 2016, AAAI.

[14]  Shie Mannor,et al.  Adaptive Skills Adaptive Partitions (ASAP) , 2016, NIPS.

[15]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[16]  Kavosh Asadi,et al.  Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[17]  Marie desJardins,et al.  Portable Option Discovery for Automated Learning Transfer in Object-Oriented Markov Decision Processes , 2015, IJCAI.

[18]  Shie Mannor,et al.  Time-regularized interrupting options , 2014, ICML 2014.

[19]  Doina Precup,et al.  Multi-Time Models for Reinforcement Learning , 2007 .

[20]  P. Tseng Solving H-horizon, stationary Markov decision problems in time proportional to log(H) , 1990 .

[21]  Andrew G. Barto,et al.  Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[22]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[23]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[24]  Dock Bumpers,et al.  Volume 2 , 2005, Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 2005..

[25]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[26]  Shie Mannor,et al.  Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations , 2014, ICML.

[27]  TWO-WEEK Loan COpy,et al.  University of California , 1886, The American journal of dental science.