Reinforcement Learning in POMDPs with Memoryless Options and Option-Observation Initiation Sets

Many real-world reinforcement learning problems have a hierarchical nature, and often exhibit some degree of partial observability. While hierarchy and partial observability are usually tackled separately (for instance by combining recurrent neural networks and options), we show that addressing both problems simultaneously is simpler and more efficient in many cases. More specifically, we make the initiation set of options conditional on the previously-executed option, and show that options with such Option-Observation Initiation Sets (OOIs) are at least as expressive as Finite State Controllers (FSCs), a state-of-the-art approach for learning in POMDPs. OOIs are easy to design based on an intuitive description of the task, lead to explainable policies and keep the top-level and option policies memoryless. Our experiments show that OOIs allow agents to learn optimal policies in challenging POMDPs, while being much more sample-efficient than a recurrent neural network over options.

[1]  D. Cliff From animals to animats , 1994, Nature.

[2]  Sebastian Thrun,et al.  Planning under Uncertainty for Reliable Health Care Robotics , 2003, FSR.

[3]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[4]  Shie Mannor,et al.  A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[5]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[6]  Jean-Arcady Meyer,et al.  Adaptive Behavior , 2005 .

[7]  양정삼 [해외 대학 연구센터 소개] Carnegie Mellon University , 2012 .

[8]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[9]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[10]  Sridhar Mahadevan,et al.  Hierarchical learning and planning in partially observable markov decision processes , 2002 .

[11]  Joel W. Burdick,et al.  Springer Tracts in Advanced Robotics , 2004 .

[12]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[13]  Jonathan P. How,et al.  Decentralized control of multi-robot partially observable Markov decision processes using belief space macro-actions , 2017, Int. J. Robotics Res..

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[16]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[17]  Peter Tino,et al.  IEEE Transactions on Neural Networks , 2009 .

[18]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[19]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[20]  Peter J. Angeline,et al.  An evolutionary algorithm that constructs recurrent neural networks , 1994, IEEE Trans. Neural Networks.

[21]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[22]  Marc'Aurelio Ranzato,et al.  Learning Longer Memory in Recurrent Neural Networks , 2014, ICLR.

[23]  M.A. Wiering,et al.  Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[24]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[25]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[26]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[27]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[28]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[29]  René Boel,et al.  Discrete event dynamic systems: Theory and applications. , 2002 .

[30]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[31]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[32]  R. Lathe Phd by thesis , 1988, Nature.

[33]  Wojciech Zaremba,et al.  Reinforcement Learning Neural Turing Machines , 2015, ArXiv.

[34]  Shimon Whiteson,et al.  Point-Based Planning for Multi-Objective POMDPs , 2015, IJCAI.

[35]  David Hsu,et al.  Monte Carlo Value Iteration with Macro-Actions , 2011, NIPS.

[36]  Andrew Zisserman,et al.  Advances in Neural Information Processing Systems (NIPS) , 2007 .

[37]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[38]  Kathryn B. Laskey,et al.  Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence , 1999 .

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Christian Laugier,et al.  The International Journal of Robotics Research (IJRR) - Special issue on ``Field and Service Robotics '' , 2009 .

[41]  Richard Dearden,et al.  Planning to see: A hierarchical approach to planning visual actions on a robot using POMDPs , 2010, Artif. Intell..

[42]  Nicholas Roy,et al.  Efficient Planning under Uncertainty with Macro-actions , 2014, J. Artif. Intell. Res..

[43]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[44]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..