论文信息 - IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures - 字舞流文

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach.

Shane Legg | Rémi Munos | Iain Dunning | Karen Simonyan | Koray Kavukcuoglu | Tim Harley | Volodymyr Mnih | Yotam Doron | Hubert Soyer | Tom Ward | Lasse Espeholt | Vlad Firoiu | K. Kavukcuoglu | R. Munos | S. Legg | Volodymyr Mnih | Hubert Soyer | K. Simonyan | Tim Harley | Lasse Espeholt | Yotam Doron | Vlad Firoiu | Iain Dunning | Tom Ward | L. Espeholt

[1] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[2] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[3] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[4] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[5] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[6] Terrence J. Sejnowski,et al. TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[7] Pawel Wawrzynski,et al. Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[8] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[9] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[10] Matthieu Geist,et al. Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[11] Shane Legg,et al. Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[12] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[13] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[14] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[15] Marc G. Bellemare,et al. Q(λ) with Off-Policy Corrections , 2016, ALT.

[16] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[17] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[18] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[19] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[20] Phil Blunsom,et al. Optimizing Performance of Recurrent Neural Networks on GPUs , 2016, ArXiv.

[21] Stephen Tyree,et al. GA3C: GPU-based A3C for Deep Reinforcement Learning , 2016, ArXiv.

[22] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[23] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[24] Demis Hassabis,et al. Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[25] Elman Mansimov,et al. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[26] Xi Chen,et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[27] Koray Kavukcuoglu,et al. Combining policy gradient and Q-learning , 2016, ICLR.

[28] Martín Abadi,et al. A computational model for TensorFlow: an introduction , 2017, MAPL@PLDI.

[29] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[30] Arjun Chandra,et al. Efficient Parallel Methods for Deep Reinforcement Learning , 2017, ArXiv.

[31] Max Jaderberg,et al. Population Based Training of Neural Networks , 2017, ArXiv.

[32] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[33] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[34] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.

[35] Henryk Michalewski,et al. Distributed Deep Reinforcement Learning: Learn how to play Atari games in 21 minutes , 2018, ISC.

[36] Shane Legg,et al. Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents , 2018, ArXiv.

[37] Vijay Vasudevan,et al. Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Marc G. Bellemare,et al. The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.