Abstract Value Iteration for Hierarchical Reinforcement Learning

We propose a novel hierarchical reinforcement learning framework for control with continuous state and action spaces. In our framework, the user specifies subgoal regions which are subsets of states; then, we (i) learn options that serve as transitions between these subgoal regions, and (ii) construct a high-level plan in the resulting abstract decision process (ADP). A key challenge is that the ADP may not be Markov, which we address by proposing two algorithms for planning in the ADP. Our first algorithm is conservative, allowing us to prove theoretical guarantees on its performance, which help inform the design of subgoal regions. Our second algorithm is a practical one that interweaves planning at the abstract level and learning at the concrete level. In our experiments, we demonstrate that our approach outperforms state-of-the-art hierarchical reinforcement learning algorithms on several challenging benchmarks.

[1]  Philip S. Thomas,et al.  Natural Option Critic , 2019, AAAI.

[2]  Stefanie Tellex,et al.  Deep Abstract Q-Networks , 2017, AAMAS.

[3]  Joseph J. Lim,et al.  Program Guided Agent , 2020, ICLR.

[4]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[5]  Sergey Levine,et al.  Near-Optimal Representation Learning for Hierarchical Reinforcement Learning , 2018, ICLR.

[6]  Stefanie Tellex,et al.  Planning with Abstract Markov Decision Processes , 2017, ICAPS.

[7]  Vijay Kumar,et al.  Graph Policy Gradients for Large Scale Robot Control , 2019, CoRL.

[8]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[9]  Doina Precup,et al.  Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[10]  Marie desJardins,et al.  Planning with Abstract Learned Models While Learning Transferable Subtasks , 2020, AAAI.

[11]  Russ Tedrake,et al.  Funnel libraries for real-time robust feedback motion planning , 2016, Int. J. Robotics Res..

[12]  Lawson L. S. Wong,et al.  State Abstraction as Compression in Apprenticeship Learning , 2019, AAAI.

[13]  Mykel J. Kochenderfer,et al.  Dynamic Real-time Multimodal Routing with Hierarchical Hybrid Planning , 2019, 2019 IEEE Intelligent Vehicles Symposium (IV).

[14]  Thomas G. Dietterich State Abstraction in MAXQ Hierarchical Reinforcement Learning , 1999, NIPS.

[15]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[16]  David Andre,et al.  State abstraction for programmable reinforcement learning agents , 2002, AAAI/IAAI.

[17]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[18]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[19]  Jan Peters,et al.  Probabilistic inference for determining options in reinforcement learning , 2016, Machine Learning.

[20]  Leslie Pack Kaelbling,et al.  Constructing Symbolic Representations for High-Level Planning , 2014, AAAI.

[21]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[22]  Sergey Levine,et al.  Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings , 2018, ICML.

[23]  Leslie Pack Kaelbling,et al.  Spatial and Temporal Abstractions in POMDPs Applied to Robot Navigation , 2005 .

[24]  Andrea Lockerd Thomaz,et al.  Automatic State Abstraction from Demonstration , 2011, IJCAI.

[25]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[26]  Doina Precup,et al.  Bounding Performance Loss in Approximate MDP Homomorphisms , 2008, NIPS.

[27]  Leslie Pack Kaelbling,et al.  Approximate Planning in POMDPs with Macro-Actions , 2003, NIPS.

[28]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[29]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[30]  Andrew G. Barto,et al.  Automated State Abstraction for Options using the U-Tree Algorithm , 2000, NIPS.

[31]  Andrew G. Barto,et al.  Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[32]  Benjamin Recht,et al.  Simple random search of static linear policies is competitive for reinforcement learning , 2018, NeurIPS.

[33]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[34]  Pablo Samuel Castro,et al.  Scalable methods for computing state similarity in deterministic Markov Decision Processes , 2019, AAAI.

[35]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[36]  Osbert Bastani,et al.  Abstract Value Iteration for Hierarchical Deep Reinforcement Learning , 2020 .

[37]  Doina Precup,et al.  Value Preserving State-Action Abstractions , 2020, AISTATS.

[38]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[39]  Rajeev Alur,et al.  A Composable Specification Language for Reinforcement Learning Tasks , 2020, NeurIPS.

[40]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[41]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[42]  Daniel E. Koditschek,et al.  Sequential Composition of Dynamically Dexterous Robot Behaviors , 1999, Int. J. Robotics Res..

[43]  Thomas J. Walsh Transferring State Abstractions Between MDPs , 2006 .

[44]  Peter Stone,et al.  State Abstraction Discovery from Irrelevant State Variables , 2005, IJCAI.

[45]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[46]  Marc G. Bellemare,et al.  Approximate Exploration through State Abstraction , 2018, ArXiv.

[47]  Sergey Levine,et al.  Generalizing Skills with Semi-Supervised Reinforcement Learning , 2016, ICLR.

[48]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[49]  Doina Precup,et al.  Automatic Construction of Temporally Extended Actions for MDPs Using Bisimulation Metrics , 2011, EWRL.