Efficient Deep Reinforcement Learning through Policy Transfer

Transfer Learning (TL) has shown great potential to accelerate Reinforcement Learning (RL) by leveraging prior knowledge from past learned policies of relevant tasks. Existing transfer approaches either explicitly computes the similarity between tasks or select appropriate source policies to provide guided explorations for the target task. However, how to directly optimize the target policy by alternatively utilizing knowledge from appropriate source policies without explicitly measuring the similarity is currently missing. In this paper, we propose a novel Policy Transfer Framework (PTF) to accelerate RL by taking advantage of this idea. Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it by modeling multi-policy transfer as the option learning problem. PTF can be easily combined with existing deep RL approaches. Experimental results show it significantly accelerates the learning process and surpasses state-ofthe-art policy transfer methods in terms of learning efficiency and final performance in both discrete and continuous action spaces.

[1]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[2]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[3]  Matthew E. Taylor,et al.  Policy Transfer using Reward Shaping , 2015, AAMAS.

[4]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[5]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[6]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[9]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Andrew Zisserman,et al.  Kickstarting Deep Reinforcement Learning , 2018, ArXiv.

[11]  Siyuan Li,et al.  An Optimal Online Method of Selecting Source Policies for Reinforcement Learning , 2017, AAAI.

[12]  Manuela M. Veloso,et al.  Probabilistic policy reuse in a reinforcement learning agent , 2006, AAMAS '06.

[13]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[14]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[15]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[16]  Doina Precup,et al.  Learnings Options End-to-End for Continuous Action Tasks , 2017, ArXiv.

[17]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  Peter Stone,et al.  Transfer Learning via Inter-Task Mappings for Temporal Difference Learning , 2007, J. Mach. Learn. Res..

[19]  Shie Mannor,et al.  Time-Regularized Interrupting Options (TRIO) , 2014, ICML.

[20]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[21]  Pushmeet Kohli,et al.  CompILE: Compositional Imitation Learning and Execution , 2018, ICML.

[22]  Siyuan Li,et al.  Context-Aware Policy Reuse , 2018, AAMAS.

[23]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[24]  Lihong Li,et al.  PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[25]  Doina Precup,et al.  The Termination Critic , 2019, AISTATS.

[26]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[27]  Balaraman Ravindran,et al.  Attend, Adapt and Transfer: Attentive Deep Architecture for Adaptive Transfer from multiple sources in the same domain , 2015, ICLR.

[28]  Yang Gao,et al.  Measuring the Distance Between Finite Markov Decision Processes , 2016, AAMAS.

[29]  Saurabh Kumar,et al.  Learning to Compose Skills , 2017, ArXiv.

[30]  Doina Precup,et al.  When Waiting is not an Option : Learning Options with a Deliberation Cost , 2017, AAAI.

[31]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[32]  Gaurav S. Sukhatme,et al.  Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets , 2017, NIPS.

[33]  Romain Laroche,et al.  Transfer Reinforcement Learning with Shared Dynamics , 2017, AAAI.

[34]  Sinno Jialin Pan,et al.  Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay , 2017, AAAI.