Probabilistic policy reuse in a reinforcement learning agent

We contribute Policy Reuse as a technique to improve a reinforcement learning agent with guidance from past learned similar policies. Our method relies on using the past policies as a probabilistic bias where the learning agent faces three choices: the exploitation of the ongoing learned policy, the exploration of random unexplored actions, and the exploitation of past policies. We introduce the algorithm and its major components: an exploration strategy to include the new reuse bias, and a similarity function to estimate the similarity of past policies with respect to a new one. We provide empirical results demonstrating that Policy Reuse improves the learning performance over different strategies that learn without reuse. Interestingly and almost as a side effect, Policy Reuse also identifies classes of similar policies revealing a basis of core policies of the domain. We demonstrate that such a basis can be built incrementally, contributing the learning of the structure of a domain.

[1]  C. Watkins Learning from delayed rewards , 1989 .

[2]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[3]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[4]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[5]  M. Veloso,et al.  Bounding the suboptimality of reusing subproblems , 1999, IJCAI 1999.

[6]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[7]  Bernhard Hengst,et al.  Discovering Hierarchy in Reinforcement Learning with HEXQ , 2002, ICML.

[8]  James L. Carroll,et al.  Fixed vs. Dynamic Sub-Transfer in Reinforcement Learning , 2002, ICMLA.

[9]  Manuela Veloso,et al.  Tree based hierarchical reinforcement learning , 2002 .

[10]  Fernando Fernández,et al.  On Determinism Handling While Learning Reduced State Space Representations , 2002, ECAI.

[11]  C. Boutilier,et al.  Accelerating Reinforcement Learning through Implicit Imitation , 2003, J. Artif. Intell. Res..

[12]  Michael G. Madden,et al.  Transfer of Experience Between Reinforcement Learning Environments with Progressive Difficulty , 2004, Artificial Intelligence Review.

[13]  Peter Stone,et al.  Behavior transfer for value-function-based reinforcement learning , 2005, AAMAS '05.

[14]  Peter Stone,et al.  Value Functions for RL-Based Behavior Transfer: A Comparative Study , 2005, AAAI.

[15]  Alicia P. Wolfe,et al.  Identifying useful subgoals in reinforcement learning by local graph partitioning , 2005, ICML.

[16]  Peter Vamplew,et al.  Concurrent Q‐learning: Reinforcement learning for dynamic goals and environments , 2005, Int. J. Intell. Syst..

[17]  Manuela Veloso,et al.  Exploration and Policy Reuse , 2005 .

[18]  Jude W. Shavlik,et al.  Giving Advice about Preferred Actions to Reinforcement Learners Via Knowledge-Based Kernel Regression , 2005, AAAI.

[19]  Peter Stone,et al.  Improving Action Selection in MDP's via Knowledge Transfer , 2005, AAAI.