Policy Distillation

Abstract: Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. Furthermore, the same method can be used to consolidate multiple task-specific policies into a single policy. We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent.

[1]  Philip J. Fleming,et al.  How not to lie with statistics: the correct way to summarize benchmark results , 1986, CACM.

[2]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[3]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[4]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[5]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[6]  Dan Klein,et al.  Structure compilation: trading structure for features , 2008, ICML '08.

[7]  Tony R. Martinez,et al.  Improving Supervised Learning by Adapting the Problem to the Learner , 2009, Int. J. Neural Syst..

[8]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[9]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[10]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[11]  Doina Precup,et al.  Generalized Classication-bas ed Approximate Policy Iteration , 2012 .

[12]  Rich Caruana,et al.  A Dozen Tricks with Multitask Learning , 1996, Neural Networks: Tricks of the Trade.

[13]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[14]  Shai Shalev-Shwartz,et al.  SelfieBoost: A Boosting Algorithm for Deep Learning , 2014, ArXiv.

[15]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[16]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[17]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[18]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[19]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[20]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  William Chan,et al.  Transferring knowledge from a RNN to a DNN , 2015, INTERSPEECH.

[23]  Dong Wang,et al.  Knowledge Transfer Pre-training , 2015, ArXiv.

[24]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[25]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[26]  Zhiyuan Tang,et al.  Recurrent neural network training with dark knowledge transfer , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).