Diff-DAC: Distributed Actor-Critic for Multitask Deep Reinforcement Learning

We propose a multiagent distributed actor-critic algorithm for multitask reinforcement learning (MRL), named Diff-DAC. The agents are connected, forming a (possibly sparse) network. Each agent is assigned a task and has access to data from this local task only. During the learning process, the agents are able to communicate some parameters to their neighbors. Since the agents incorporate their neighbors’ parameters into their own learning rules, the information is diffused across the network, and they can learn a common policy that generalizes well across all tasks. Diff-DAC is scalable since the computational complexity and communication overhead per agent grow with the number of neighbors, rather than with the total number of agents. Moreover, the algorithm is fully distributed in the sense that agents self-organize, with no need for coordinator node. Diff-DAC follows an actorcritic scheme where the value function and the policy are approximated with deep neural networks, being able to learn expressive policies from raw data. As a by-product of DiffDAC’s derivation from duality theory, we provide novel insights into the standard actor-critic framework, showing that it is actually an instance of the dual ascent method to approximate the solution of a linear program. Experiments illustrate the performance of the algorithm in the cart-pole, inverted pendulum, and swing-up cart-pole environments.

[1]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[2]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[3]  Asuman E. Ozdaglar,et al.  Distributed Alternating Direction Method of Multipliers , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[5]  Mengdi Wang,et al.  An online primal-dual method for discounted Markov decision processes , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[6]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[7]  Ali H. Sayed,et al.  Adaptive Networks , 2014, Proceedings of the IEEE.

[8]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[9]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[10]  Lihong Li,et al.  Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[11]  Haitham Bou-Ammar,et al.  Scalable Multitask Policy Gradient Reinforcement Learning , 2017, AAAI.

[12]  Sergio Valcarcel Macua Distributed optimization, control and learning in multiagent networks , 2017 .

[13]  David Pfau,et al.  Connecting Generative Adversarial Networks and Actor-Critic Methods , 2016, ArXiv.

[14]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[15]  Ali H. Sayed,et al.  Performance Limits for Distributed Estimation Over LMS Adaptive Networks , 2012, IEEE Transactions on Signal Processing.

[16]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[17]  Haitham Bou-Ammar,et al.  An exact distributed newton method for reinforcement learning , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[18]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[21]  Ali H. Sayed,et al.  Diffusion strategies for adaptation and learning over networks: an examination of distributed strategies and network behavior , 2013, IEEE Signal Processing Magazine.

[22]  E. Seneta Non-negative Matrices and Markov Chains , 2008 .

[23]  Simone Scardapane,et al.  A Framework for Parallel and Distributed Training of Neural Networks , 2016, Neural Networks.

[24]  Ali Sayed,et al.  Adaptation, Learning, and Optimization over Networks , 2014, Found. Trends Mach. Learn..

[25]  Ali H. Sayed,et al.  Asynchronous Adaptation and Learning Over Networks—Part I: Modeling and Stability Analysis , 2013, IEEE Transactions on Signal Processing.

[26]  Pascal Bianchi,et al.  Success and failure of adaptation-diffusion algorithms for consensus in multi-agent networks , 2014, 53rd IEEE Conference on Decision and Control.

[27]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[28]  H. Vincent Poor,et al.  QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations , 2012, IEEE Trans. Signal Process..

[29]  Pawel Wawrzynski,et al.  Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[30]  Ali H. Sayed,et al.  Distributed Pareto Optimization via Diffusion Strategies , 2012, IEEE Journal of Selected Topics in Signal Processing.

[31]  Ali H. Sayed,et al.  Distributed Policy Evaluation Under Multiple Behavior Strategies , 2013, IEEE Transactions on Automatic Control.

[32]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[33]  Chinmay Hegde,et al.  Collaborative Deep Learning in Fixed Topology Networks , 2017, NIPS.

[34]  Tao Wang,et al.  Dual Representations for Dynamic Programming and Reinforcement Learning , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[35]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[36]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[37]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[38]  Eric Eaton,et al.  Online Multi-Task Learning for Policy Gradient Methods , 2014, ICML.

[39]  Panos M. Pardalos,et al.  Convex optimization theory , 2010, Optim. Methods Softw..

[40]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[41]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.