Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning

Multi-simulator training has contributed to the recent success of Deep Reinforcement Learning by stabilizing learning and allowing for higher training throughputs. We propose Gossip-based Actor-Learner Architectures (GALA) where several actor-learners (such as A2C agents) are organized in a peer-to-peer communication topology, and exchange information through asynchronous gossip in order to take advantage of a large number of distributed simulators. We prove that GALA agents remain within an epsilon-ball of one-another during training when using loosely coupled asynchronous communication. By reducing the amount of synchronization between agents, GALA is more computationally efficient and scalable compared to A2C, its fully-synchronous counterpart. GALA also outperforms A2C, being more robust and sample efficient. We show that we can run several loosely coupled GALA agents in parallel on a single GPU and achieve significantly higher hardware utilization and frame-rates than vanilla A2C at comparable power draws.

[1]  Edward Grefenstette,et al.  TorchBeast: A PyTorch Platform for Distributed RL , 2019, ArXiv.

[2]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[3]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[4]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[5]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[6]  Christoforos N. Hadjicostis,et al.  Average Consensus in the Presence of Delays in Directed Graph Topologies , 2014, IEEE Transactions on Automatic Control.

[7]  Valerie Isham,et al.  Non‐Negative Matrices and Markov Chains , 1983 .

[8]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[9]  Stephen Tyree,et al.  Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , 2016, ICLR.

[10]  Arjun Chandra,et al.  Efficient Parallel Methods for Deep Reinforcement Learning , 2017, ArXiv.

[11]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[12]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[13]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[14]  Michael G. Rabbat,et al.  Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[15]  E. Seneta Non-negative Matrices and Markov Chains , 2008 .

[16]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[17]  Yuandong Tian,et al.  ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero , 2019, ICML.

[18]  Michael G. Rabbat,et al.  Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization , 2017, Proceedings of the IEEE.

[19]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[20]  Marc G. Bellemare,et al.  The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[21]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[22]  Michael G. Rabbat,et al.  Asynchronous Gradient Push , 2018, IEEE Transactions on Automatic Control.

[23]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[24]  J. Wolfowitz Products of indecomposable, aperiodic, stochastic matrices , 1963 .

[25]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[26]  Stephen Tyree,et al.  GA3C: GPU-based A3C for Deep Reinforcement Learning , 2016, ArXiv.

[27]  J.N. Tsitsiklis,et al.  Convergence in Multiagent Coordination, Consensus, and Flocking , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[28]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[29]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[30]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[31]  Pieter Abbeel,et al.  Accelerated Methods for Deep Reinforcement Learning , 2018, ArXiv.

[32]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[33]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.