Scalable Planning and Learning for Multiagent POMDPs

Online, sample-based planning algorithms for POMDPs have shown great promise in scaling to problems with large state spaces, but they become intractable for large action and observation spaces. This is particularly problematic in multiagent POMDPs where the action and observation space grows exponentially with the number of agents. To combat this intractability, we propose a novel scalable approach based on sample-based planning and factored value functions that exploits structure present in many multiagent settings. This approach applies not only in the planning case, but also in the Bayesian reinforcement learning setting. Experimental results show that we are able to provide high quality solutions to large multiagent planning and learning problems.

[1]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[2]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[3]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[4]  Makoto Yokoo,et al.  Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs , 2005, IJCAI.

[5]  P. J. Gmytrasiewicz,et al.  A Framework for Sequential Planning in Multi-Agent Settings , 2005, AI&M.

[6]  Nataliya Sokolovska,et al.  Continuous Upper Confidence Trees , 2011, LION.

[7]  Shimon Whiteson,et al.  Exploiting locality of interaction in factored Dec-POMDPs , 2008, AAMAS.

[8]  Pedro U. Lima,et al.  Efficient Offline Communication Policies for Factored Multiagent POMDPs , 2011, NIPS.

[9]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[10]  Claudia V. Goldman,et al.  Solving Transition Independent Decentralized Markov Decision Processes , 2004, J. Artif. Intell. Res..

[11]  Joshua B. Tenenbaum,et al.  Nonparametric Bayesian Policy Priors for Reinforcement Learning , 2010, NIPS.

[12]  Michael Fairbank,et al.  The divergence of reinforcement learning algorithms with value-iteration and function approximation , 2011, The 2012 International Joint Conference on Neural Networks (IJCNN).

[13]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[14]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[15]  Lawrence Carin,et al.  Learning to Explore and Exploit in POMDPs , 2009, NIPS.

[16]  Frans A. Oliehoek,et al.  Best Response Bayesian Reinforcement Learning for Multiagent Systems with State Uncertainty , 2014 .

[17]  Pascal Poupart,et al.  Model-based Bayesian Reinforcement Learning in Partially Observable Domains , 2008, ISAIM.

[18]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[19]  Alborz Geramifard,et al.  Decentralized control of Partially Observable Markov Decision Processes using belief space macro-actions , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Richard S. Sutton,et al.  Temporal-difference search in computer Go , 2012, Machine Learning.

[21]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[22]  Guy Shani,et al.  Model-Based Online Learning of POMDPs , 2005, ECML.

[23]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[24]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[25]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[26]  Jaakko Peltonen,et al.  Efficient Planning for Factored Infinite-Horizon DEC-POMDPs , 2011, IJCAI.

[27]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28]  Shimon Whiteson,et al.  Exploiting Structure in Cooperative Bayesian Games , 2012, UAI.

[29]  Victor R. Lesser,et al.  Self-organization for coordinating decentralized reinforcement learning , 2010, AAMAS.

[30]  Andrew Wang,et al.  Bayes-Adaptive Interactive POMDPs , 2012, AAAI.

[31]  Nicholas R. Jennings,et al.  Decentralized Bayesian reinforcement learning for online agent collaboration , 2012, AAMAS.

[32]  Umar Syed,et al.  Graphical Models for Bandit Problems , 2011, UAI.

[33]  Nikos A. Vlassis,et al.  Collaborative Multiagent Reinforcement Learning by Payoff Propagation , 2006, J. Mach. Learn. Res..

[34]  Shie Mannor,et al.  Bayesian Reinforcement Learning , 2012, Reinforcement Learning.

[35]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[36]  Rémi Coulom,et al.  Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[37]  Olivier Buffet,et al.  Exploiting separability in multiagent planning with continuous-state MDPs , 2014, AAMAS.

[39]  Marc Toussaint,et al.  Model-free reinforcement learning as mixture learning , 2009, ICML '09.

[40]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[41]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[42]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[43]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[44]  Shlomo Zilberstein,et al.  Constraint-based dynamic programming for decentralized POMDPs with structured interactions , 2009, AAMAS.

[45]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[46]  Leslie Pack Kaelbling,et al.  All learning is Local: Multi-agent Learning in Global Reward Games , 2003, NIPS.

[47]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[48]  Victor R. Lesser,et al.  Multiagent reinforcement learning and self-organization in a network of agents , 2007, AAMAS '07.

[49]  Craig Boutilier,et al.  Coordination in multiagent reinforcement learning: a Bayesian approach , 2003, AAMAS '03.

[50]  Nicholas R. Jennings,et al.  Decentralised coordination of low-power embedded devices using the max-sum algorithm , 2008, AAMAS.

[51]  Milind Tambe,et al.  The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models , 2011, J. Artif. Intell. Res..

[52]  Shimon Whiteson,et al.  Approximate solutions for factored Dec-POMDPs with many agents , 2013, AAMAS.

[53]  Frans A. Oliehoek,et al.  Decentralized POMDPs , 2012, Reinforcement Learning.

[54]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.