Policy Gradient With Value Function Approximation For Collective Multiagent Planning

Decentralized (PO)MDPs provide an expressive framework for sequential decision making in a multiagent system. Given their computational complexity, recent research has focused on tractable yet practical subclasses of Dec-POMDPs. We address such a subclass called CDEC-POMDP where the collective behavior of a population of agents affects the joint-reward and environment dynamics. Our main contribution is an actor-critic (AC) reinforcement learning method for optimizing CDEC-POMDP policies. Vanilla AC has slow convergence for larger problems. To address this, we show how a particular decomposition of the approximate action-value function over agents leads to effective updates, and also derive a new way to train the critic based on local reward signals. Comparisons on a synthetic benchmark and a real-world taxi fleet optimization problem show that our new AC approach provides better quality solutions than previous best approaches.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[3]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[4]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[5]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[6]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[7]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[8]  Claudia V. Goldman,et al.  Solving Transition Independent Decentralized Markov Decision Processes , 2004, J. Artif. Intell. Res..

[9]  Victor R. Lesser,et al.  Decentralized Markov decision processes with event-driven interactions , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[10]  Makoto Yokoo,et al.  Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs , 2005, IJCAI.

[11]  Andrew Y. Ng,et al.  On Local Rewards and Scaling Distributed Reinforcement Learning , 2005, NIPS.

[12]  Douglas Aberdeen,et al.  Policy-Gradient Methods for Planning , 2005, NIPS.

[13]  Andreas S. Schulz,et al.  The Complexity of Congestion Games , 2008 .

[14]  Edmund H. Durfee,et al.  Influence-Based Policy Abstraction for Weakly-Coupled Dec-POMDPs , 2010, ICAPS.

[15]  Marc Toussaint,et al.  Scalable Multiagent Planning Using Probabilistic Inference , 2011, IJCAI.

[16]  Shih-Fen Cheng,et al.  Decision Support for Agent Populations in Uncertain and Congested Environments , 2012, AAAI.

[17]  Hari Balakrishnan,et al.  TCP ex machina: computer-generated congestion control , 2013, SIGCOMM.

[18]  Ari Hottinen,et al.  Optimizing Spatial and Temporal Reuse in Wireless Networks by Decentralized Partially Observable Markov Decision Processes , 2014, IEEE Transactions on Mobile Computing.

[19]  Patrick Jaillet,et al.  Decentralized Stochastic Planning with Anonymity in Interactions , 2014, AAAI.

[20]  Yingke Chen,et al.  Individual Planning in Agent Populations: Exploiting Anonymity and Frame-Action Hypergraphs , 2015, ICAPS.

[21]  Jonathan P. How,et al.  Planning for decentralized control of multiple robots under uncertainty , 2014, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Marc Toussaint,et al.  Probabilistic Inference Techniques for Scalable Multiagent Decision Making , 2015, J. Artif. Intell. Res..

[25]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[26]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[27]  David Silver,et al.  Learning values across many orders of magnitude , 2016, NIPS.

[28]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[29]  Mykel J. Kochenderfer,et al.  Exploiting Anonymity in Approximate Linear Programming: Scaling to Large Multiagent MDPs , 2015, AAAI.

[30]  Hoong Chuin Lau,et al.  Collective Multiagent Sequential Decision Making Under Uncertainty , 2017, AAAI.

[31]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.