Multiagent Reinforcement Learning: Rollout and Policy Iteration

We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration (PI), i.e., start from some base policy and generate an improved policy. Rollout is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout has a robustness property: it can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program. In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected (conceptually) by a separate agent. This is the class of multiagent problems where the agents have a shared objective function, and a shared and perfect state information. Based on a problem reformulation that trades off control space complexity with state space complexity, we develop an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents. For discounted and other infinite horizon problems, we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation. For one of our PI algorithms, we prove convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. For another PI algorithm, which is executed over a more complex state space, we prove convergence to an optimal policy. Approximate forms of these algorithms are also given, based on the use of policy and value neural networks. These PI algorithms, in both their exact and their approximate form are strictly off-line methods, but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.

[1]  Stephanie Gil,et al.  Multiagent Rollout and Policy Iteration for POMDP with Application to Multi-Robot Repair Problems , 2020, CoRL.

[2]  Ioannis Ch. Paschalidis,et al.  Learning parametric policies and transition probability models of markov decision processes from data , 2020, Eur. J. Control.

[3]  Giovanni Russo,et al.  On a probabilistic approach to synthesize control policies from example datasets , 2020, Autom..

[4]  Dimitri Bertsekas,et al.  Multiagent Value Iteration Algorithms in Dynamic Programming and Reinforcement Learning , 2020, Results in Control and Optimization.

[5]  Abhishek Gupta,et al.  Existence of Team-Optimal Solutions in Static Teams with Common Information: A Topology of Information Approach , 2020, SIAM J. Control. Optim..

[6]  D. Bertsekas Constrained Multiagent Rollout and Multidimensional Assignment with the Auction Algorithm , 2020, ArXiv.

[7]  Stephanie Gil,et al.  Reinforcement Learning for POMDP: Partitioned Rollout and Policy Iteration With Application to Autonomous Sequential Repair Problems , 2020, IEEE Robotics and Automation Letters.

[8]  Na Li,et al.  Distributed Reinforcement Learning for Decentralized Linear Quadratic Control: A Derivative-Free Policy Optimization Approach , 2019, IEEE Transactions on Automatic Control.

[9]  Thomas Parisini,et al.  Neural Approximations for Optimal Control and Decision , 2019 .

[10]  T. Başar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[11]  D. Bertsekas Multiagent Rollout Algorithms and Reinforcement Learning , 2019, ArXiv.

[12]  Guannan Qu,et al.  Exploiting Fast Decaying and Locality in Multi-Agent MDP with Tree Dependence Structure , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[13]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2018, Autonomous Agents and Multi-Agent Systems.

[14]  Afshin Oroojlooyjadid,et al.  A review of cooperative multi-agent deep reinforcement learning , 2019, Applied Intelligence.

[15]  Ashutosh Nayyar,et al.  Common Knowledge and Sequential Team Problems , 2019, IEEE Transactions on Automatic Control.

[16]  Yan Zhang,et al.  Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[17]  Saeid Nahavandi,et al.  Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications , 2018, IEEE Transactions on Cybernetics.

[18]  Shimon Whiteson,et al.  Multi-Agent Common Knowledge Reinforcement Learning , 2018, NeurIPS.

[19]  Ying Wen,et al.  Factorized Q-learning for large-scale multi-agent systems , 2018, DAI.

[20]  Dimitri P. Bertsekas,et al.  Feature-based aggregation and deep reinforcement learning: a survey and some new implementations , 2018, IEEE/CAA Journal of Automatica Sinica.

[21]  Tamer Basar,et al.  Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents , 2018, ICML.

[22]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[23]  Jangwon Lee,et al.  A survey of robot learning from demonstrations for Human-Robot Collaboration , 2017, ArXiv.

[24]  Pablo Hernandez-Leal,et al.  A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity , 2017, ArXiv.

[25]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[26]  Hao Liu,et al.  Learning Policies for Markov Decision Processes From Data , 2017, IEEE Transactions on Automatic Control.

[27]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[28]  Jonathan P. How,et al.  Graph-based Cross Entropy method for solving multi-robot decentralized POMDPs , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[29]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[30]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[31]  Jan Peters,et al.  Learning responsive robot behavior by imitation , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[32]  Dimitri P. Bertsekas,et al.  Q-learning and policy iteration algorithms for stochastic shortest path problems , 2012, Annals of Operations Research.

[33]  Dimitri P. Bertsekas,et al.  Abstract Dynamic Programming , 2013 .

[34]  Dario Bauso,et al.  Team Theory and Person-by-Person Optimization with Binary Decisions , 2012, SIAM J. Control. Optim..

[35]  Ashutosh Nayyar,et al.  Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach , 2012, IEEE Transactions on Automatic Control.

[36]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[37]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[38]  Ioannis Ch. Paschalidis,et al.  Mobile agent coordination via a distributed actor-critic algorithm , 2011, 2011 19th Mediterranean Conference on Control & Automation (MED).

[39]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[40]  Magnus Egerstedt,et al.  Graph Theoretic Methods in Multiagent Networks , 2010, Princeton Series in Applied Mathematics.

[41]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[42]  Ioannis Ch. Paschalidis,et al.  A Distributed Actor-Critic Algorithm and Applications to Mobile Sensor Network Coordination Problems , 2010, IEEE Transactions on Automatic Control.

[43]  Jorge Cortes,et al.  Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms , 2009 .

[44]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[45]  R. Pesenti,et al.  Generalized person-by-person optimization in team problems with binary decisions , 2008, 2008 American Control Conference.

[46]  Christos Dimitrakakis,et al.  Rollout sampling approximate policy iteration , 2008, Machine Learning.

[47]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[48]  Csaba Szepesvári,et al.  Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[49]  Sean Luke,et al.  Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[50]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[51]  Gerald Tesauro,et al.  Extending Q-Learning to General Adaptive Multi-Agent Systems , 2003, NIPS.

[52]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[53]  Manuela M. Veloso,et al.  Multiagent Systems: A Survey from a Machine Learning Perspective , 2000, Auton. Robots.

[54]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[55]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[56]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[57]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[58]  Michael Athans,et al.  Survey of decentralized control methods for large scale systems , 1978 .

[59]  Tsuneo Yoshikawa,et al.  Decomposition of Dynamic Team Decision Problems , 1977 .

[60]  H. Witsenhausen Separation of estimation and control for discrete time systems , 1971 .

[61]  H. Witsenhausen On Information Structures, Feedback and Causality , 1971 .

[62]  H. Witsenhausen A Counterexample in Stochastic Optimum Control , 1968 .

[63]  R. Radner,et al.  Team Decision Problems , 1962 .

[64]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[65]  D. Bertsekas Rollout, Approximate Policy Iteration, and Distributed Reinforcement Learning , 2020 .

[66]  D. Bertsekas Reinforcement Learning and Optimal ControlA Selective Overview , 2018 .

[67]  Bart De Schutter,et al.  Multi-agent Reinforcement Learning: An Overview , 2010 .

[68]  Dimitri P. Bertsekas,et al.  Neuro-Dynamic Programming , 2009, Encyclopedia of Optimization.

[69]  Nikos A. Vlassis,et al.  The Cross-Entropy Method for Policy Search in Decentralized POMDPs , 2008, Informatica.

[70]  Agostino Poggi,et al.  Multiagent Systems , 2006, Intelligenza Artificiale.

[71]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[72]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[73]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[74]  Yu-Chi Ho Team decision theory and information structures , 1980, Proceedings of the IEEE.

[75]  R. Radner,et al.  Economic theory of teams , 1972 .

[76]  J. Marschak,et al.  Elements for a Theory of Teams , 1955 .

[77]  J. Walrand,et al.  Distributed Dynamic Programming , 2022 .