论文信息 - Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning - 字舞流文

Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning

Multi-simulator training has contributed to the recent success of Deep Reinforcement Learning by stabilizing learning and allowing for higher training throughputs. We propose Gossip-based Actor-Learner Architectures (GALA) where several actor-learners (such as A2C agents) are organized in a peer-to-peer communication topology, and exchange information through asynchronous gossip in order to take advantage of a large number of distributed simulators. We prove that GALA agents remain within an epsilon-ball of one-another during training when using loosely coupled asynchronous communication. By reducing the amount of synchronization between agents, GALA is more computationally efficient and scalable compared to A2C, its fully-synchronous counterpart. GALA also outperforms A2C, being more robust and sample efficient. We show that we can run several loosely coupled GALA agents in parallel on a single GPU and achieve significantly higher hardware utilization and frame-rates than vanilla A2C at comparable power draws.

Joelle Pineau | Joshua Romoff | Mahmoud Assran | Nicolas Ballas | Mike Rabbat | Michael G. Rabbat | Joelle Pineau | Nicolas Ballas | Mahmoud Assran | Joshua Romoff

[1] Edward Grefenstette,et al. TorchBeast: A PyTorch Platform for Distributed RL , 2019, ArXiv.

[2] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .

[3] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.

[4] Wei Zhang,et al. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[5] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[6] Christoforos N. Hadjicostis,et al. Average Consensus in the Presence of Delays in Directed Graph Topologies , 2014, IEEE Transactions on Automatic Control.

[7] Valerie Isham,et al. Non‐Negative Matrices and Markov Chains , 1983 .

[8] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[9] Stephen Tyree,et al. Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , 2016, ICLR.

[10] Arjun Chandra,et al. Efficient Parallel Methods for Deep Reinforcement Learning , 2017, ArXiv.

[11] Wei Zhang,et al. Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[12] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[13] Shane Legg,et al. Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[14] Michael G. Rabbat,et al. Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[15] E. Seneta. Non-negative Matrices and Markov Chains , 2008 .

[16] Rémi Munos,et al. Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[17] Yuandong Tian,et al. ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero , 2019, ICML.

[18] Michael G. Rabbat,et al. Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization , 2017, Proceedings of the IEEE.

[19] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[20] Marc G. Bellemare,et al. The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[21] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[22] Michael G. Rabbat,et al. Asynchronous Gradient Push , 2018, IEEE Transactions on Automatic Control.

[23] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[24] J. Wolfowitz. Products of indecomposable, aperiodic, stochastic matrices , 1963 .

[25] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[26] Stephen Tyree,et al. GA3C: GPU-based A3C for Deep Reinforcement Learning , 2016, ArXiv.

[27] J.N. Tsitsiklis,et al. Convergence in Multiagent Coordination, Consensus, and Flocking , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[28] Demis Hassabis,et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[29] John N. Tsitsiklis,et al. Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[30] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[31] Pieter Abbeel,et al. Accelerated Methods for Deep Reinforcement Learning , 2018, ArXiv.

[32] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[33] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.