Hierarchical Imitation and Reinforcement Learning

We study how to effectively leverage expert feedback to learn sequential decision-making policies. We focus on problems with sparse rewards and long time horizons, which typically pose significant challenges in reinforcement learning. We propose an algorithmic framework, called hierarchical guidance, that leverages the hierarchical structure of the underlying problem to integrate different modes of expert interaction. Our framework can incorporate different combinations of imitation learning (IL) and reinforcement learning (RL) at different levels, leading to dramatic reductions in both expert effort and cost of exploration. Using long-horizon benchmarks, including Montezuma's Revenge, we demonstrate that our approach can learn significantly faster than hierarchical RL, and be significantly more label-efficient than standard IL. We also theoretically analyze labeling cost for certain instantiations of our framework.

[1]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[2]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[3]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[4]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[5]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[6]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[7]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[8]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[9]  Nicholas Roy,et al.  PUMA: Planning Under Uncertainty with Macro-Actions , 2010, AAAI.

[10]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[11]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[12]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[13]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[14]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[15]  David L. Roberts,et al.  Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning , 2015, Autonomous Agents and Multi-Agent Systems.

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  John Langford,et al.  Learning to Search Better than Your Teacher , 2015, ICML.

[18]  Yisong Yue,et al.  Generating Long-term Trajectories Using Deep Hierarchical Networks , 2016, NIPS.

[19]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[20]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[21]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[22]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[23]  Peter Stone,et al.  Deep Reinforcement Learning in Parameterized Action Space , 2015, ICLR.

[24]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[25]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[26]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[27]  Hannes Schulz,et al.  Frames: a corpus for adding memory to goal-oriented dialogue systems , 2017, SIGDIAL Conference.

[28]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[29]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[30]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[31]  Kam-Fai Wong,et al.  Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[32]  Alessandro Lazaric,et al.  Exploration – Exploitation in MDPs with Options , 2016 .

[33]  Stefanie Tellex,et al.  Deep Abstract Q-Networks , 2017, AAMAS.

[34]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[35]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.