Dual Policy Distillation

Policy distillation, which transfers a teacher policy to a student policy has achieved great success in challenging tasks of deep reinforcement learning. This teacher-student framework requires a well-trained teacher model which is computationally expensive. Moreover, the performance of the student model could be limited by the teacher model if the teacher model is not optimal. In the light of collaborative learning, we study the feasibility of involving joint intellectual efforts from diverse perspectives of student models. In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment and extract knowledge from each other to enhance their learning. The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms, since it is unclear whether the knowledge distilled from an imperfect and noisy peer learner would be helpful. To address the challenge, we theoretically justify that distilling knowledge from a peer learner will lead to policy improvement and propose a disadvantageous distillation strategy based on the theoretical results. The conducted experiments on several continuous control tasks show that the proposed framework achieves superior performance with a learning-based agent and function approximation without the use of expensive teacher models.

[1]  Sinno Jialin Pan,et al.  Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay , 2017, AAAI.

[2]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[3]  Yoav Shoham,et al.  Multi-Agent Reinforcement Learning:a critical survey , 2003 .

[4]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[5]  JOHN F. Young Machine Intelligence , 1971, Nature.

[6]  Pierre Dillenbourg,et al.  Collaborative Learning: Cognitive and Computational Approaches. Advances in Learning and Instruction Series. , 1999 .

[7]  Pierre Dillenbourg,et al.  Collaborative Learning: Cognitive and Computational Approaches , 1999 .

[8]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[9]  Qiang Liu,et al.  Learning to Explore via Meta-Policy Gradient , 2018, ICML.

[10]  Felipe Leno da Silva,et al.  Simultaneously Learning and Advising in Multiagent Reinforcement Learning , 2017, AAMAS.

[11]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[12]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[13]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[14]  Razvan Pascanu,et al.  Distilling Policy Distillation , 2019, AISTATS.

[15]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[16]  Daochen Zha,et al.  Experience Replay Optimization , 2019, IJCAI.

[17]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[18]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[21]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[22]  Claude Sammut,et al.  A Framework for Behavioural Cloning , 1995, Machine Intelligence 15.

[23]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[24]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[25]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[26]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[27]  Daochen Zha,et al.  RLCard: A Toolkit for Reinforcement Learning in Card Games , 2019, ArXiv.

[28]  Jonathan P. How,et al.  Learning to Teach in Cooperative Multiagent Reinforcement Learning , 2018, AAAI.

[29]  Qiang Liu,et al.  Learning to Explore with Meta-Policy Gradient , 2018, ICML 2018.

[30]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[31]  Shu Wang,et al.  Collaborative Deep Reinforcement Learning , 2017, ArXiv.

[32]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.