论文信息 - Learning by Playing - Solving Sparse Reward Tasks from Scratch - 字舞流文

Learning by Playing - Solving Sparse Reward Tasks from Scratch

We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors - from scratch - in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment - enabling it to excel at sparse reward RL. Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach.

Martin A. Riedmiller | Thomas Lampe | Jost Tobias Springenberg | Nicolas Heess | Roland Hafner | Michael Neunert | Jonas Degrave | Volodymyr Mnih | Tom Van de Wiele | N. Heess | Roland Hafner | T. Lampe | Volodymyr Mnih | Michael Neunert | Jonas Degrave | T. Wiele | J. T. Springenberg | Thomas Lampe

[1] Peter Dayan,et al. Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[2] Rich Caruana,et al. Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[3] Thomas G. Dietterich. The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[4] Preben Alstrøm,et al. Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[5] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[6] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[7] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[8] Andrea Bonarini,et al. Transfer of samples in batch reinforcement learning , 2008, ICML '08.

[9] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[10] Sriraam Natarajan,et al. Transfer in variable-reward hierarchical reinforcement learning , 2008, Machine Learning.

[11] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[12] Richard L. Lewis,et al. Where Do Rewards Come From , 2009 .

[13] Jan Peters,et al. Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[14] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[15] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[16] Jan Peters,et al. Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[17] Jürgen Schmidhuber,et al. Learning skills from play: Artificial curiosity on a Katana robot arm , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[18] Tom Schaul,et al. Better Generalization with Forecasts , 2013, IJCAI.

[19] Jürgen Schmidhuber,et al. PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem , 2011, Front. Psychol..

[20] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[21] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[22] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[23] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[24] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[26] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[27] Samuel Gershman,et al. Deep Successor Reinforcement Learning , 2016, ArXiv.

[28] Joshua B. Tenenbaum,et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[29] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[30] Sergey Levine,et al. Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[31] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[32] Pierre-Yves Oudeyer,et al. Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning , 2017, J. Mach. Learn. Res..

[33] Doina Precup,et al. The Option-Critic Architecture , 2016, AAAI.

[34] Martin A. Riedmiller,et al. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[35] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[36] Vladlen Koltun,et al. Learning to Act by Predicting the Future , 2016, ICLR.

[37] Razvan Pascanu,et al. Sim-to-Real Robot Learning from Pixels with Progressive Nets , 2016, CoRL.

[38] Marcin Andrychowicz,et al. One-Shot Imitation Learning , 2017, NIPS.

[39] Wojciech Zaremba,et al. Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[40] Sergey Levine,et al. Unsupervised Perceptual Rewards for Imitation Learning , 2016, Robotics: Science and Systems.

[41] Yuval Tassa,et al. Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[42] Peter Stone,et al. Autonomous Task Sequencing for Customized Curriculum Design in Reinforcement Learning , 2017, IJCAI.

[43] Razvan Pascanu,et al. Learning to Navigate in Complex Environments , 2016, ICLR.

[44] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[45] Tom Schaul,et al. Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[46] Guillaume Lample,et al. Playing FPS Games with Deep Reinforcement Learning , 2016, AAAI.

[47] Sergey Levine,et al. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[48] Misha Denil,et al. The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously , 2017, CoRL.

[49] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.

[50] Sergey Levine,et al. Divide-and-Conquer Reinforcement Learning , 2017, ICLR.

[51] Sergey Levine,et al. Sim2Real View Invariant Visual Servoing by Recurrent Control , 2017, ArXiv.

[52] Marcin Andrychowicz,et al. Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[53] Murray Shanahan,et al. Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[54] Peter Stone,et al. Stochastic Grounded Action Transformation for Robot Learning in Simulation , 2017, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).