A Deep Hierarchical Reinforcement Learning Algorithm in Partially Observable Markov Decision Processes

In recent years, reinforcement learning (RL) has achieved remarkable success due to the growing adoption of deep learning techniques and the rapid growth of computing power. Nevertheless, it is well-known that flat reinforcement learning algorithms are often have trouble learning and are even data-efficient with respect to tasks having hierarchical structures, e.g., those consisting of multiple subtasks. Hierarchical reinforcement learning is a principled approach that can tackle such challenging tasks. On the other hand, many real-world tasks usually have only partial observability in which state measurements are often imperfect and partially observable. The problems of RL in such settings can be formulated as a partially observable Markov decision process (POMDP). In this paper, we study hierarchical RL in a POMDP in which the tasks have only partial observability and possess hierarchical properties. We propose a hierarchical deep reinforcement learning approach for learning in hierarchical POMDP. The deep hierarchical RL algorithm is proposed for domains to both MDP and POMDP learning. We evaluate the proposed algorithm using various challenging hierarchical POMDPs.

[1]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[2]  E. Deci,et al.  Intrinsic and Extrinsic Motivations: Classic Definitions and New Directions. , 2000, Contemporary educational psychology.

[3]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[4]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[5]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[6]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[7]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[8]  Andrew G. Barto,et al.  Intrinsically Motivated Reinforcement Learning: A Promising Framework for Developmental Robot Learning , 2005 .

[9]  Ion Stoica,et al.  Multi-Level Discovery of Deep Options , 2017, ArXiv.

[10]  Murray Shanahan,et al.  Classifying Options for Deep Reinforcement Learning , 2016, ArXiv.

[11]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[12]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[13]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[14]  Chelsea C. White,et al.  Procedures for the Solution of a Finite-Horizon, Partially Observed, Semi-Markov Optimization Problem , 1976, Oper. Res..

[15]  Andre Cohen,et al.  An object-oriented representation for efficient reinforcement learning , 2008, ICML '08.

[16]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[17]  Maxim Egorov,et al.  Deep Reinforcement Learning with POMDPs , 2015 .

[18]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[19]  Sridhar Mahadevan,et al.  Deep Reinforcement Learning With Macro-Actions , 2016, ArXiv.

[20]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[21]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[23]  Youyong Kong,et al.  Deep Direct Reinforcement Learning for Financial Signal Representation and Trading , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[24]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[25]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[26]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[27]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[28]  Jürgen Leitner,et al.  Curiosity driven reinforcement learning for motion planning on humanoids , 2014, Front. Neurorobot..

[29]  Byoung-Tak Zhang,et al.  Micro-Objective Learning : Accelerating Deep Reinforcement Learning through the Discovery of Continuous Subgoals , 2017, ArXiv.

[30]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[31]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[32]  Rob Fergus,et al.  MazeBase: A Sandbox for Learning from Games , 2015, ArXiv.

[33]  Nuttapong Chentanez,et al.  Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[34]  Von-Wun Soo,et al.  Subgoal Identifications in Reinforcement Learning: A Survey , 2011 .

[35]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[36]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[37]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[38]  Gerald Tesauro,et al.  Simulation, learning, and optimization techniques in Watson's game strategies , 2012, IBM J. Res. Dev..

[39]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[40]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[41]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[42]  Gerald Tesauro,et al.  TD-Gammon: A Self-Teaching Backgammon Program , 1995 .

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[45]  TaeChoong Chung,et al.  Bayes-adaptive hierarchical MDPs , 2015, Applied Intelligence.

[46]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[47]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[48]  Kevin P. Murphy,et al.  A Survey of POMDP Solution Techniques , 2000 .

[49]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[50]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[51]  D. Ernst,et al.  Power systems stability control: reinforcement learning framework , 2004, IEEE Transactions on Power Systems.

[52]  Sungyoung Lee,et al.  Approximate planning for bayesian hierarchical reinforcement learning , 2014, Applied Intelligence.

[53]  Marc Toussaint,et al.  Hierarchical Monte-Carlo Planning , 2015, AAAI.

[54]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[55]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[56]  Richard L. Lewis,et al.  Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective , 2010, IEEE Transactions on Autonomous Mental Development.

[57]  Guillaume Lample,et al.  Playing FPS Games with Deep Reinforcement Learning , 2016, AAAI.

[58]  Matthew Saffell,et al.  Learning to trade via direct reinforcement , 2001, IEEE Trans. Neural Networks.

[59]  AUTOMATED DISCOVERY OF OPTIONS IN REINFORCEMENT LEARNING , 2003 .

[60]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[61]  Jae Won Lee,et al.  Stock price prediction using reinforcement learning , 2001, ISIE 2001. 2001 IEEE International Symposium on Industrial Electronics Proceedings (Cat. No.01TH8570).

[62]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.