Self-Attentional Credit Assignment for Transfer in Reinforcement Learning

The ability to transfer knowledge to novel environments and tasks is a sensible desiderata for general learning agents. Despite the apparent promises, transfer in RL is still an open and little exploited research area. In this paper, we take a brand-new perspective about transfer: we suggest that the ability to assign credit unveils structural invariants in the tasks that can be transferred to make RL more sample-efficient. Our main contribution is SECRET, a novel approach to transfer learning for RL that uses a backward-view credit assignment mechanism based on a self-attentive architecture. Two aspects are key to its generality: it learns to assign credit as a separate offline supervised process and exclusively modifies the reward function. Consequently, it can be supplemented by transfer methods that do not modify the reward function and it can be plugged on top of any RL algorithm.

[1]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[2]  Dale Schuurmans,et al.  Learning to Generalize from Sparse and Underspecified Rewards , 2019, ICML.

[3]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[4]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[5]  Sergey Levine,et al.  Recall Traces: Backtracking Models for Efficient Reinforcement Learning , 2018, ICLR.

[6]  Ben Goertzel,et al.  The Architecture of Human-Like General Intelligence , 2012 .

[7]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[8]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[9]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[10]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[11]  Wojciech Czarnecki,et al.  Multi-task Deep Reinforcement Learning with PopArt , 2018, AAAI.

[12]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[13]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[14]  OctoMiao Overcoming catastrophic forgetting in neural networks , 2016 .

[15]  Marc Pollefeys,et al.  Episodic Curiosity through Reachability , 2018, ICLR.

[16]  Yan Wu,et al.  Optimizing agent behavior over long time scales by transporting value , 2018, Nature Communications.

[17]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[18]  Deniz Yuret,et al.  Transfer Learning for Low-Resource Neural Machine Translation , 2016, EMNLP.

[19]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[20]  Andrew Zisserman,et al.  Kickstarting Deep Reinforcement Learning , 2018, ArXiv.

[21]  Karl Tuyls,et al.  Integrating State Representation Learning Into Deep Reinforcement Learning , 2018, IEEE Robotics and Automation Letters.

[22]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[23]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[24]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[25]  Christopher Joseph Pal,et al.  Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding , 2018, NeurIPS.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[30]  Sepp Hochreiter,et al.  RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[31]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[32]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[33]  Pieter Abbeel,et al.  Evolved Policy Gradients , 2018, NeurIPS.

[34]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[35]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[36]  Razvan Pascanu,et al.  Progressive Neural Networks , 2016, ArXiv.

[37]  Ben Goertzel,et al.  Theoretical Foundations of Artificial General Intelligence , 2012, Atlantis Thinking Machines.

[38]  Dong Yan,et al.  Reward Shaping via Meta-Learning , 2019, ArXiv.

[39]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[40]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[41]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[42]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.