Optimal and Approximate Q-value Functions for Decentralized POMDPs

Decision-theoretic planning is a popular approach to sequential decision making problems, because it treats uncertainty in sensing and acting in a principled way. In single-agent frameworks like MDPs and POMDPs, planning can be carried out by resorting to Q-value functions: an optimal Q-value function Q* is computed in a recursive manner by dynamic programming, and then an optimal policy is extracted from Q*. In this paper we study whether similar Q-value functions can be defined for decentralized POMDP models (Dec-POMDPs), and how policies can be extracted from such value functions. We define two forms of the optimal Q-value function for Dec-POMDPs: one that gives a normative description as the Q-value function of an optimal pure joint policy and another one that is sequentially rational and thus gives a recipe for computation. This computation, however, is infeasible for all but the smallest problems. Therefore, we analyze various approximate Q-value functions that allow for efficient computation. We describe how they relate, and we prove that they all provide an upper bound to the optimal Q-value function Q*. Finally, unifying some previous approaches for solving Dec-POMDPs, we describe a family of algorithms for extracting policies from such Q-value functions, and perform an experimental evaluation on existing test problems, including a new firefighting benchmark problem.

[1]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[2]  H. W. Kuhn,et al.  11. Extensive Games and the Problem of Information , 1953 .

[3]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[4]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[5]  Frits C. Schoute Symmetric team problems and multi access wire communication , 1978, Autom..

[6]  S. Marcus,et al.  Decentralized control of finite state Markov processes , 1980, 1980 19th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[7]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[8]  M. Aicardi,et al.  Decentralized optimal control of Markov chains with a common past information set , 1987 .

[9]  Ken Binmore,et al.  Fun and games : a text on game theory , 1991 .

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Bernhard von Stengel,et al.  Fast algorithms for finding randomized strategies in game trees , 1994, STOC '94.

[12]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[13]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[14]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[15]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[16]  G. W. Wornell,et al.  Decentralized control of a multiple access broadcast channel: performance bounds , 1996, Proceedings of 35th IEEE Conference on Decision and Control.

[17]  Craig Boutilier,et al.  Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[18]  Hiroaki Kitano,et al.  RoboCup: The Robot World Cup Initiative , 1997, AGENTS '97.

[19]  Avi Pfeffer,et al.  Representations and Solutions for Game-Theoretic Problems , 1997, Artif. Intell..

[20]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[21]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[22]  Hiroaki Kitano,et al.  RoboCup Rescue: search and rescue in large-scale disasters as a domain for autonomous agents research , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[23]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[24]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[25]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[26]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[27]  Eitan Altman,et al.  Applications of Markov Decision Processes in Communication Networks , 2000 .

[28]  Lex Weaver,et al.  A Multi-Agent Policy-Gradient Approach to Network Routing , 2001, ICML.

[29]  Victor R. Lesser,et al.  Communication decisions in multi-agent cooperation: model and experiments , 2001, AGENTS '01.

[30]  Milind Tambe,et al.  Team Formation for Reformation in Multiagent Domains Like RoboCupRescue , 2002, RoboCup.

[31]  Leslie Pack Kaelbling,et al.  Reinforcement Learning by Policy Search , 2002 .

[32]  François Charpillet,et al.  A heuristic approach for solving decentralized-POMDP: assessment on the pursuit problem , 2002, SAC '02.

[33]  Milind Tambe,et al.  The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models , 2011, J. Artif. Intell. Res..

[34]  Milind Tambe,et al.  Team Formation for Reformation , 2002 .

[35]  Lynne E. Parker,et al.  Guest editorial advances in multirobot systems , 2002, IEEE Trans. Robotics Autom..

[36]  Milind Tambe,et al.  Role allocation and reallocation in multiagent teams: towards a practical analysis , 2003, AAMAS '03.

[37]  Claudia V. Goldman,et al.  Optimizing information exchange in cooperative multi-agent systems , 2003, AAMAS '03.

[38]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[39]  Milind Tambe,et al.  Distributed Sensor Networks: A Multiagent Perspective , 2003 .

[40]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[41]  David V. Pynadath,et al.  Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings , 2003, IJCAI.

[42]  Peter Norvig,et al.  Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[43]  Makoto Yokoo,et al.  Communications for improving policy computation in distributed POMDPs , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[44]  P. J. Gmytrasiewicz,et al.  A Framework for Sequential Planning in Multi-Agent Settings , 2005, AI&M.

[45]  Jeff G. Schneider,et al.  Approximate solutions for partially observable stochastic games with common payoffs , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[46]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[47]  Claudia V. Goldman,et al.  Solving Transition Independent Decentralized Markov Decision Processes , 2004, J. Artif. Intell. Res..

[48]  Claudia V. Goldman,et al.  Decentralized Control of Cooperative Systems: Categorization and Complexity Analysis , 2004, J. Artif. Intell. Res..

[49]  Victor R. Lesser,et al.  Decentralized Markov decision processes with event-driven interactions , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[50]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[51]  S. Zilberstein,et al.  Complexity analysis and optimal algorithms for decentralized decision making , 2005 .

[52]  Abdel-Illah Mouaddib,et al.  A polynomial algorithm for decentralized Markov decision processes with temporal constraints , 2005, AAMAS '05.

[53]  Makoto Yokoo,et al.  Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs , 2005, IJCAI.

[54]  Jeff G. Schneider,et al.  Game Theoretic Control for Robot Teams , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[55]  Dimitri P. Bertsekas,et al.  Dynamic programming and optimal control, 3rd Edition , 2005 .

[56]  Manuela M. Veloso,et al.  Reasoning about joint beliefs for execution-time communication decisions , 2005, AAMAS '05.

[57]  Victor R. Lesser,et al.  Analyzing Myopic Approaches for Multi-Agent Communication , 2005, IAT.

[58]  François Charpillet,et al.  MAA*: A Heuristic Search Algorithm for Solving Decentralized POMDPs , 2005, UAI.

[59]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[60]  François Charpillet,et al.  An Optimal Best-First Search Algorithm for Solving Infinite Horizon DEC-POMDPs , 2005, ECML.

[61]  Victor R. Lesser,et al.  Analyzing myopic approaches for multi-agent communication , 2005, IEEE/WIC/ACM International Conference on Intelligent Agent Technology.

[62]  Shlomo Zilberstein,et al.  Bounded Policy Iteration for Decentralized POMDPs , 2005, IJCAI.

[63]  Brahim Chaib-draa,et al.  An online POMDP algorithm for complex multiagent environments , 2005, AAMAS '05.

[64]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[65]  Makoto Yokoo,et al.  Exploiting Locality of Interaction in Networked Distributed POMDPs , 2006, AAAI Spring Symposium: Distributed Plan and Schedule Management.

[66]  Nikos A. Vlassis,et al.  Decentralized planning under uncertainty for teams of communicating agents , 2006, AAMAS '06.

[67]  Claudia V. Goldman,et al.  Learning to communicate in a decentralized environment , 2007, Autonomous Agents and Multi-Agent Systems.

[68]  Jianhui Wu,et al.  Mixed-integer linear programming for transition-independent decentralized MDPs , 2006, AAMAS '06.

[69]  Frans A. Oliehoek,et al.  A hierarchical model for decentralized fighting of large scale urban fires , 2006 .

[70]  Abdel-Illah Mouaddib,et al.  An Iterative Algorithm for Solving Constrained Decentralized Markov Decision Processes , 2006, AAAI.

[71]  Frans A. Oliehoek,et al.  Dec-POMDPs and extensive form games: equivalence of models and algorithms , 2006 .

[72]  François Charpillet,et al.  Point-based Dynamic Programming for DEC-POMDPs , 2006, AAAI.

[73]  Benjamin Van Roy,et al.  An approximate dynamic programming approach to decentralized control of stochastic systems , 2006 .

[74]  S. Zilberstein,et al.  Optimal Fixed-Size Controllers for Decentralized POMDPs , 2006 .

[75]  Reid G. Simmons,et al.  Heuristic anytime approaches to stochastic decision processes , 2006, J. Heuristics.

[76]  Makoto Yokoo,et al.  Winning back the CUP for distributed POMDPs: planning over continuous belief spaces , 2006, AAMAS '06.

[77]  Shlomo Zilberstein,et al.  Optimizing Memory-Bounded Controllers for Decentralized POMDPs , 2007, UAI.

[78]  Nikos A. Vlassis,et al.  Q-value functions for decentralized POMDPs , 2007, AAMAS '07.

[79]  Shlomo Zilberstein,et al.  Memory-Bounded Dynamic Programming for DEC-POMDPs , 2007, IJCAI.

[80]  Frans A. Oliehoek,et al.  Dec-POMDPs with delayed communication , 2007 .

[81]  S. Zilberstein,et al.  Bounded Dynamic Programming for Decentralized POMDPs , 2007 .

[82]  Milind Tambe,et al.  On opportunistic techniques for solving decentralized Markov decision processes with temporal constraints , 2007, AAMAS '07.

[83]  Manuela M. Veloso,et al.  Exploiting factored representations for decentralized execution in multiagent teams , 2007, AAMAS '07.

[84]  François Charpillet,et al.  Mixed Integer Linear Programming for Exact Finite-Horizon Planning in Decentralized Pomdps , 2007, ICAPS.

[85]  Shlomo Zilberstein,et al.  Improved Memory-Bounded Dynamic Programming for Decentralized POMDPs , 2007, UAI.

[86]  Nikos A. Vlassis,et al.  A Cross-Entropy Approach to Solving Dec-POMDPs , 2007, IDC.

[87]  Makoto Yokoo,et al.  Letting loose a SPIDER on a network of POMDPs: generating quality guaranteed policies , 2007, AAMAS '07.

[88]  Francisco S. Melo,et al.  Interaction-driven Markov games for decentralized multiagent planning under uncertainty , 2008, AAMAS.

[89]  Shimon Whiteson,et al.  Exploiting locality of interaction in factored Dec-POMDPs , 2008, AAMAS.