Abstraction from demonstration for efficient reinforcement learning in high-dimensional domains

Reinforcement learning (RL) and learning from demonstration (LfD) are two popular families of algorithms for learning policies for sequential decision problems, but they are often ineffective in high-dimensional domains unless provided with either a great deal of problem-specific domain information or a carefully crafted representation of the state and dynamics of the world. We introduce new approaches inspired by these two techniques, which we broadly call abstraction from demonstration. Our first algorithm, state abstraction from demonstration (AfD), uses a small set of human demonstrations of the task the agent must learn to determine a state-space abstraction. Our second algorithm, abstraction and decomposition from demonstration (ADA), is additionally able to determine a task decomposition from the demonstrations. These abstractions allow RL to scale up to higher-complexity domains, and offer much better performance than LfD with orders of magnitude fewer demonstrations. Using a set of videogame-like domains, we demonstrate that using abstraction from demonstration can obtain up to exponential speed-ups in table-based representations, and polynomial speed-ups when compared with function approximation-based RL algorithms such as fitted Q-learning and LSPI.

[1]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[2]  Thomas G. Dietterich,et al.  Automatic discovery and transfer of MAXQ hierarchies , 2008, ICML '08.

[3]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[4]  Peter Stone,et al.  Combining manual feedback with subsequent MDP reward signals for reinforcement learning , 2010, AAMAS.

[5]  José María Valls,et al.  Correcting and improving imitation models of humans for Robosoccer agents , 2005, 2005 IEEE Congress on Evolutionary Computation.

[6]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[7]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[8]  Erik D. Demaine,et al.  Classic Nintendo Games are (NP-)Hard , 2012, ArXiv.

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[11]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[12]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[13]  Peng Zhou,et al.  Discovering options from example trajectories , 2009, ICML '09.

[14]  N. Cowan,et al.  The Magical Mystery Four , 2010, Current directions in psychological science.

[15]  T. Wilson Strangers to Ourselves: Discovering the Adaptive Unconscious , 2002 .

[16]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Gerald Tesauro,et al.  Programming backgammon using self-teaching neural nets , 2002, Artif. Intell..

[19]  Peter Stone,et al.  The utility of temporal abstraction in reinforcement learning , 2008, AAMAS.

[20]  Tor Nørretranders,et al.  The User Illusion: Cutting Consciousness Down to Size , 1998 .

[21]  Shimon Whiteson,et al.  EFFICIENT ABSTRACTION SELECTION IN REINFORCEMENT LEARNING , 2013, Comput. Intell..

[22]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[23]  Erik Talvitie,et al.  Simple Local Models for Complex Dynamical Systems , 2008, NIPS.

[24]  Bart De Schutter,et al.  Least-Squares Methods for Policy Iteration , 2012, Reinforcement Learning.

[25]  Leslie Pack Kaelbling,et al.  Effective reinforcement learning for mobile robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[26]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[27]  Andrea Lockerd Thomaz,et al.  Automatic State Abstraction from Demonstration , 2011, IJCAI.

[28]  Lihong Li,et al.  A worst-case comparison between temporal difference and residual gradient with linear function approximation , 2008, ICML '08.

[29]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[30]  Peter Stone,et al.  State Abstraction Discovery from Irrelevant State Variables , 2005, IJCAI.

[31]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[32]  Alborz Geramifard,et al.  iLSTD: Eligibility Traces and Convergence Analysis , 2006, NIPS.

[33]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[34]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[35]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[36]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[37]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[38]  Sridhar Mahadevan,et al.  A reinforcement learning model of selective visual attention , 2001, AGENTS '01.

[39]  Andrew G. Barto,et al.  Automated State Abstraction for Options using the U-Tree Algorithm , 2000, NIPS.

[40]  Honglak Lee,et al.  Unsupervised learning of hierarchical representations with convolutional deep belief networks , 2011, Commun. ACM.

[41]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[42]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[43]  Shimon Whiteson,et al.  Evolutionary Function Approximation for Reinforcement Learning , 2006, J. Mach. Learn. Res..

[44]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[45]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[46]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[47]  P. Schönemann On artificial intelligence , 1985, Behavioral and Brain Sciences.

[48]  Andrew G. Barto,et al.  Efficient skill learning using abstraction selection , 2009, IJCAI 2009.

[49]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[50]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[51]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[52]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[53]  Pieter Abbeel,et al.  Apprenticeship learning for helicopter control , 2009, CACM.

[54]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[55]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[56]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[57]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[58]  Andrea Lockerd Thomaz,et al.  Object focused q-learning for autonomous agents , 2013, AAMAS.

[59]  Bernhard Hengst,et al.  Discovering Hierarchy in Reinforcement Learning with HEXQ , 2002, ICML.

[60]  Shimon Whiteson,et al.  Switching between Representations in Reinforcement Learning , 2010, Interactive Collaborative Information Systems.

[61]  Andrew G. Barto,et al.  Intrinsically Motivated Hierarchical Skill Learning in Structured Environments , 2010, IEEE Transactions on Autonomous Mental Development.

[62]  Natalia H. Gardiol,et al.  Learning with Deictic Representation , 2002 .

[63]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[64]  Shimon Whiteson,et al.  Efficient Abstraction Selection in Reinforcement Learning (Extended Abstract) , 2013, SARA.

[65]  Lihong Li,et al.  Reducing reinforcement learning to KWIK online regression , 2010, Annals of Mathematics and Artificial Intelligence.

[66]  Maja J. Matarić,et al.  Learning to Use Selective Attention and Short-Term Memory in Sequential Tasks , 1996 .

[67]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[68]  Andrea Lockerd Thomaz,et al.  Automatic task decomposition and state abstraction from demonstration , 2012, AAMAS.

[69]  Scott Kuindersma,et al.  Robot learning from demonstration by constructing skill trees , 2012, Int. J. Robotics Res..

[70]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[71]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[72]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.