论文信息 - Successor Features for Transfer in Reinforcement Learning

Successor Features for Transfer in Reinforcement Learning

Transfer in reinforcement learning refers to the notion that generalization should occur not only within a task but also across tasks. We propose a transfer framework for the scenario where the reward function changes between tasks but the environment's dynamics remain the same. Our approach rests on two key ideas: "successor features", a value function representation that decouples the dynamics of the environment from the rewards, and "generalized policy improvement", a generalization of dynamic programming's policy improvement operation that considers a set of policies rather than a single one. Put together, the two ideas lead to an approach that integrates seamlessly within the reinforcement learning framework and allows the free exchange of information across tasks. The proposed method also provides performance guarantees for the transferred policy even before any learning has taken place. We derive two theorems that set our approach in firm theoretical ground and present experiments that show that it successfully promotes transfer in practice, significantly outperforming alternative methods in a sequence of navigation tasks and in the control of a simulated robotic arm.

[1] Peter Dayan,et al. Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[2] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3] Richard S. Sutton,et al. Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[4] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[5] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[6] Christopher G. Atkeson,et al. A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[7] Rich Caruana,et al. Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[8] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[9] Daniel S. Bernstein,et al. Reusing Old Policies to Accelerate Learning on New MDPs , 1999 .

[10] Jonathan Baxter,et al. A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[11] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[12] Richard S. Sutton,et al. Predictive Representations of State , 2001, NIPS.

[13] Sridhar Mahadevan,et al. Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[14] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[15] D. Ruppert. The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[16] Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[17] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[18] Michael L. Littman,et al. A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[19] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20] Sriraam Natarajan,et al. Dynamic preferences in multi-criteria reinforcement learning , 2005, ICML.

[21] Manfred Huber,et al. Effective Control Knowledge Transfer through Learning Skill and Representation Hierarchies , 2007, IJCAI.

[22] Massimiliano Pontil,et al. Convex multi-task feature learning , 2008, Machine Learning.

[23] Marek Petrik,et al. An Analysis of Laplacian Methods for Value Function Approximation in MDPs , 2007, IJCAI.

[24] Lihong Li,et al. Analyzing feature generation for value-function approximation , 2007, ICML '07.

[25] Sridhar Mahadevan,et al. Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[26] Sriraam Natarajan,et al. Transfer in variable-reward hierarchical reinforcement learning , 2008, Machine Learning.

[27] Lihong Li,et al. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[28] Peter Stone,et al. Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[29] Javier García,et al. Probabilistic Policy Reuse for inter-task transfer learning , 2010, Robotics Auton. Syst..

[30] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[31] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[32] Alessandro Lazaric,et al. Transfer in Reinforcement Learning: A Framework and a Survey , 2012, Reinforcement Learning.

[33] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[34] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[35] Shalabh Bhatnagar,et al. Universal Option Models , 2014, NIPS.

[36] Richard S. Sutton,et al. Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[37] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[38] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39] Peter Kulchyski. and , 2015 .

[40] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[41] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[42] Samuel Gershman,et al. Deep Successor Reinforcement Learning , 2016, ArXiv.

[43] Wolfram Burgard,et al. Deep reinforcement learning with successor features for navigation across similar environments , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).