论文信息 - Optimal and Approximate Q-value Functions for Decentralized POMDPs

Optimal and Approximate Q-value Functions for Decentralized POMDPs

Decision-theoretic planning is a popular approach to sequential decision making problems, because it treats uncertainty in sensing and acting in a principled way. In single-agent frameworks like MDPs and POMDPs, planning can be carried out by resorting to Q-value functions: an optimal Q-value function Q* is computed in a recursive manner by dynamic programming, and then an optimal policy is extracted from Q*. In this paper we study whether similar Q-value functions can be defined for decentralized POMDP models (Dec-POMDPs), and how policies can be extracted from such value functions. We define two forms of the optimal Q-value function for Dec-POMDPs: one that gives a normative description as the Q-value function of an optimal pure joint policy and another one that is sequentially rational and thus gives a recipe for computation. This computation, however, is infeasible for all but the smallest problems. Therefore, we analyze various approximate Q-value functions that allow for efficient computation. We describe how they relate, and we prove that they all provide an upper bound to the optimal Q-value function Q*. Finally, unifying some previous approaches for solving Dec-POMDPs, we describe a family of algorithms for extracting policies from such Q-value functions, and perform an experimental evaluation on existing test problems, including a new firefighting benchmark problem.

[1] J. Nash. Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[2] H. W. Kuhn,et al. 11. Extensive Games and the Problem of Information , 1953 .

[3] E. J. Sondik,et al. The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[4] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Vol. II , 1976 .

[5] Frits C. Schoute. Symmetric team problems and multi access wire communication , 1978, Autom..

[6] S. Marcus,et al. Decentralized control of finite state Markov processes , 1980, 1980 19th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[7] John N. Tsitsiklis,et al. The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[8] M. Aicardi,et al. Decentralized optimal control of Markov chains with a common past information set , 1987 .

[9] Ken Binmore,et al. Fun and games : a text on game theory , 1991 .

[10] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11] Bernhard von Stengel,et al. Fast algorithms for finding randomized strategies in game trees , 1994, STOC '94.

[12] Ariel Rubinstein,et al. A Course in Game Theory , 1995 .

[13] Peter Norvig,et al. Artificial Intelligence: A Modern Approach , 1995 .

[14] Leslie Pack Kaelbling,et al. Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[15] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[16] G. W. Wornell,et al. Decentralized control of a multiple access broadcast channel: performance bounds , 1996, Proceedings of 35th IEEE Conference on Decision and Control.

[17] Craig Boutilier,et al. Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[18] Hiroaki Kitano,et al. RoboCup: The Robot World Cup Initiative , 1997, AGENTS '97.

[19] Avi Pfeffer,et al. Representations and Solutions for Game-Theoretic Problems , 1997, Artif. Intell..

[20] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[21] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[22] Hiroaki Kitano,et al. RoboCup Rescue: search and rescue in large-scale disasters as a domain for autonomous agents research , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[23] Craig Boutilier,et al. Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[24] Neil Immerman,et al. The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[25] Kee-Eung Kim,et al. Learning to Cooperate via Policy Search , 2000, UAI.

[26] Milos Hauskrecht,et al. Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[27] Eitan Altman,et al. Applications of Markov Decision Processes in Communication Networks , 2000 .

[28] Lex Weaver,et al. A Multi-Agent Policy-Gradient Approach to Network Routing , 2001, ICML.

[29] Victor R. Lesser,et al. Communication decisions in multi-agent cooperation: model and experiments , 2001, AGENTS '01.

[30] Milind Tambe,et al. Team Formation for Reformation in Multiagent Domains Like RoboCupRescue , 2002, RoboCup.

[31] Leslie Pack Kaelbling,et al. Reinforcement Learning by Policy Search , 2002 .

[32] François Charpillet,et al. A heuristic approach for solving decentralized-POMDP: assessment on the pursuit problem , 2002, SAC '02.

[33] Milind Tambe,et al. The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models , 2011, J. Artif. Intell. Res..

[34] Milind Tambe,et al. Team Formation for Reformation , 2002 .

[35] Lynne E. Parker,et al. Guest editorial advances in multirobot systems , 2002, IEEE Trans. Robotics Autom..

[36] Milind Tambe,et al. Role allocation and reallocation in multiagent teams: towards a practical analysis , 2003, AAMAS '03.

[37] Claudia V. Goldman,et al. Optimizing information exchange in cooperative multi-agent systems , 2003, AAMAS '03.

[38] Shobha Venkataraman,et al. Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[39] Milind Tambe,et al. Distributed Sensor Networks: A Multiagent Perspective , 2003 .

[40] Joelle Pineau,et al. Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[41] David V. Pynadath,et al. Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings , 2003, IJCAI.

[42] Peter Norvig,et al. Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[43] Makoto Yokoo,et al. Communications for improving policy computation in distributed POMDPs , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[44] P. J. Gmytrasiewicz,et al. A Framework for Sequential Planning in Multi-Agent Settings , 2005, AI&M.

[45] Jeff G. Schneider,et al. Approximate solutions for partially observable stochastic games with common payoffs , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[46] Shlomo Zilberstein,et al. Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[47] Claudia V. Goldman,et al. Solving Transition Independent Decentralized Markov Decision Processes , 2004, J. Artif. Intell. Res..

[48] Claudia V. Goldman,et al. Decentralized Control of Cooperative Systems: Categorization and Complexity Analysis , 2004, J. Artif. Intell. Res..

[49] Victor R. Lesser,et al. Decentralized Markov decision processes with event-driven interactions , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[50] Nikos A. Vlassis,et al. Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[51] S. Zilberstein,et al. Complexity analysis and optimal algorithms for decentralized decision making , 2005 .

[52] Abdel-Illah Mouaddib,et al. A polynomial algorithm for decentralized Markov decision processes with temporal constraints , 2005, AAMAS '05.

[53] Makoto Yokoo,et al. Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs , 2005, IJCAI.

[54] Jeff G. Schneider,et al. Game Theoretic Control for Robot Teams , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[55] Dimitri P. Bertsekas,et al. Dynamic programming and optimal control, 3rd Edition , 2005 .

[56] Manuela M. Veloso,et al. Reasoning about joint beliefs for execution-time communication decisions , 2005, AAMAS '05.

[57] Victor R. Lesser,et al. Analyzing Myopic Approaches for Multi-Agent Communication , 2005, IAT.

[58] François Charpillet,et al. MAA*: A Heuristic Search Algorithm for Solving Decentralized POMDPs , 2005, UAI.

[59] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[60] François Charpillet,et al. An Optimal Best-First Search Algorithm for Solving Infinite Horizon DEC-POMDPs , 2005, ECML.

[61] Victor R. Lesser,et al. Analyzing myopic approaches for multi-agent communication , 2005, IEEE/WIC/ACM International Conference on Intelligent Agent Technology.

[62] Shlomo Zilberstein,et al. Bounded Policy Iteration for Decentralized POMDPs , 2005, IJCAI.

[63] Brahim Chaib-draa,et al. An online POMDP algorithm for complex multiagent environments , 2005, AAMAS '05.

[64] Shie Mannor,et al. A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[65] Makoto Yokoo,et al. Exploiting Locality of Interaction in Networked Distributed POMDPs , 2006, AAAI Spring Symposium: Distributed Plan and Schedule Management.

[66] Nikos A. Vlassis,et al. Decentralized planning under uncertainty for teams of communicating agents , 2006, AAMAS '06.