Identifying Reusable Macros for Efficient Exploration via Policy Compression

Reinforcement Learning agents often need to solve not a single task, but several tasks pertaining to a same domain; in particular, each task corresponds to an MDP drawn from a family of related MDPs (a domain). An agent learning in this setting should be able exploit policies it has learned in the past, for a given set of sample tasks, in order to more rapidly acquire policies for novel tasks. Consider, for instance, a navigation problem where an agent may have to learn to navigate different (but related) mazes. Even though these correspond to distinct tasks (since the goal and starting locations of the agent may change, as well as the maze configuration itself), their solutions do share common properties---e.g. in order to reach distant areas of the maze, an agent should not move in circles. After an agent has learned to solve a few sample tasks, it may be possible to leverage the acquired experience to facilitate solving novel tasks from the same domain. Our work is motivated by the observation that trajectory samples from optimal policies for tasks belonging to a common domain, often reveal underlying useful patterns for solving novel tasks. We propose an optimization objective that characterizes the problem of learning reusable temporally extended actions (macros). We introduce a computationally tractable surrogate objective that is equivalent to finding macros that allow for maximal compression of a given set of sampled trajectories. We develop a compression-based approach for obtaining such macros and propose an exploration strategy that takes advantage of them. We show that meaningful behavioral patterns can be identified from sample policies over discrete and continuous action spaces, and present evidence that the proposed exploration strategy improves learning time on novel tasks.

[1]  Andrew G. Barto,et al.  Conjugate Markov Decision Processes , 2011, ICML.

[2]  Andrew G. Barto,et al.  Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[3]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[4]  Sridhar Mahadevan,et al.  Proto-value functions: developmental reinforcement learning , 2005, ICML.

[5]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[6]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[7]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[8]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[9]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[10]  Balaraman Ravindran,et al.  Hierarchical Reinforcement Learning using Spatio-Temporal Abstractions and Deep Neural Networks , 2016, ArXiv.

[11]  Justin Fu,et al.  EX2: Exploration with Exemplar Models for Deep Reinforcement Learning , 2017, NIPS.

[12]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[13]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[14]  Andrew G. Barto,et al.  Autonomous shaping: knowledge transfer in reinforcement learning , 2006, ICML.

[15]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[16]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[17]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[18]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[19]  Steven D. Whitehead,et al.  Complexity and Cooperation in Q-Learning , 1991, ML.

[20]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[21]  Michael Buro,et al.  On the Maximum Length of Huffman Codes , 1993, Inf. Process. Lett..

[22]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[23]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[24]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[25]  Nathan R. Sturtevant,et al.  Benchmarks for Grid-Based Pathfinding , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[26]  M. Anand “1984” , 1962 .

[27]  Sergey Levine,et al.  Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning , 2017, ICLR.

[28]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[29]  R. Sutton,et al.  Macro-Actions in Reinforcement Learning: An Empirical Analysis , 1998 .