Optimally solving Dec-POMDPs as Continuous-State MDPs: Theory and Algorithms

Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general model for decision-making under uncertainty in cooperative decentralized settings, but are difficult to solve optimally (NEXP-Complete). As a new way of solving these problems, we introduce the idea of transforming a Dec-POMDP into a continuous-state deterministic MDP with a piecewise-linear and convex value function. This approach makes use of the fact that planning can be accomplished in a centralized offline manner, while execution can still be distributed. This new Dec-POMDP formulation, which we call an occupancy MDP, allows powerful POMDP and continuous-state MDP methods to be used for the first time. When the curse of dimensionality becomes too prohibitive, we refine this basic approach and present ways to combine heuristic search and compact representations that exploit the structure present in multi-agent domains, without losing the ability to eventually converge to an optimal solution. In particular, we introduce feature-based heuristic search that relies on feature-based compact representations, point-based updates and efficient action selection. A theoretical analysis demonstrates that our feature-based heuristic search algorithms terminate in finite time with an optimal solution. We include an extensive empirical analysis using well known benchmarks, thereby demonstrating our approach provides significant scalability improvements compared to the state of the art.

[1]  Abdel-Illah Mouaddib,et al.  Collective Decision under Partial Observability - A Dynamic Local Interaction Model , 2018, IJCCI.

[2]  Olivier Buffet,et al.  Exploiting separability in multiagent planning with continuous-state MDPs , 2014, AAMAS.

[3]  Alain Dutech,et al.  An Investigation into Mathematical Programming for Finite Horizon Decentralized POMDPs , 2014, J. Artif. Intell. Res..

[4]  Charles L. Isbell,et al.  Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs , 2013, NIPS.

[5]  Hari Balakrishnan,et al.  TCP ex machina: computer-generated congestion control , 2013, SIGCOMM.

[6]  Frans A. Oliehoek,et al.  Sufficient Plan-Time Statistics for Decentralized POMDPs , 2013, IJCAI.

[7]  Guy Shani,et al.  Noname manuscript No. (will be inserted by the editor) A Survey of Point-Based POMDP Solvers , 2022 .

[8]  François Charpillet,et al.  Producing efficient error-bounded solutions for transition independent decentralized mdps , 2013, AAMAS.

[9]  Shimon Whiteson,et al.  Incremental Clustering and Expansion for Faster Optimal Planning in Dec-POMDPs , 2013, J. Artif. Intell. Res..

[10]  Arnaud Doniec,et al.  Scaling Up Decentralized MDPs Through Heuristic Search , 2012, UAI.

[11]  Frans A. Oliehoek,et al.  Tree-Based Solution Methods for Multiagent POMDPs with Delayed Communication , 2012, AAAI.

[12]  Manuela M. Veloso,et al.  Decentralized MDPs with sparse interactions , 2011, Artif. Intell..

[13]  Weihong Zhang,et al.  Speeding Up the Convergence of Value Iteration in Partially Observable Markov Decision Processes , 2011, J. Artif. Intell. Res..

[14]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[15]  Brahim Chaib-draa,et al.  Task allocation learning in a multiagent environment: Application to the RoboCupRescue simulation , 2010, Multiagent Grid Syst..

[16]  Shlomo Zilberstein,et al.  Point-based backup for decentralized POMDPs: complexity and new algorithms , 2010, AAMAS.

[17]  Martin C. Cooper,et al.  Soft arc consistency revisited , 2010, Artif. Intell..

[18]  Shlomo Zilberstein,et al.  Incremental Policy Generation for Finite-Horizon DEC-POMDPs , 2009, ICAPS.

[19]  Guy Shani,et al.  Topological Order Planner for POMDPs , 2009, IJCAI.

[20]  Makoto Yokoo,et al.  DCOPs meet the realworld: exploring unknown reward matrices with applications to mobile sensor networks , 2009, IJCAI 2009.

[21]  Shlomo Zilberstein,et al.  Constraint-based dynamic programming for decentralized POMDPs with structured interactions , 2009, AAMAS.

[22]  Brahim Chaib-draa,et al.  Point-based incremental pruning heuristic for solving finite-horizon DEC-POMDPs , 2009, AAMAS.

[23]  Shimon Whiteson,et al.  Lossless clustering of histories in decentralized POMDPs , 2009, AAMAS.

[24]  Brahim Chaib-draa,et al.  Exact Dynamic Programming for Decentralized POMDPs with Lossless Policy Compression , 2008, ICAPS.

[25]  Manuela M. Veloso,et al.  An approximate algorithm for solving oracular POMDPs , 2008, 2008 IEEE International Conference on Robotics and Automation.

[26]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[27]  Warrren B Powell Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007, Wiley Series in Probability and Statistics.

[28]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[29]  Reid G. Simmons,et al.  Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic , 2006, AAAI.

[30]  Masafumi Yamashita,et al.  Erratum: Distributed Anonymous Mobile Robots: Formation of Geometric Patterns , 2006, SIAM J. Comput..

[31]  Shlomo Zilberstein,et al.  Solving POMDPs using quadratically constrained linear programs , 2006, AAMAS '06.

[32]  Simon de Givry,et al.  Existential arc consistency: Getting closer to full arc consistency in weighted CSPs , 2005, IJCAI.

[33]  S. Zilberstein,et al.  MAA*: A Heuristic Search Algorithm for Solving Decentralized POMDPs , 2005, UAI.

[34]  Makoto Yokoo,et al.  Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs , 2005, IJCAI.

[35]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[36]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[37]  Claudia V. Goldman,et al.  Solving Transition Independent Decentralized Markov Decision Processes , 2004, J. Artif. Intell. Res..

[38]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[39]  Stuart Dreyfus,et al.  Richard Ernest Bellman , 2003 .

[40]  David V. Pynadath,et al.  Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings , 2003, IJCAI.

[41]  Anne Condon,et al.  On the undecidability of probabilistic planning and related stochastic optimization problems , 2003, Artif. Intell..

[42]  Rina Dechter,et al.  Mini-buckets: A general scheme for bounded inference , 2003, JACM.

[43]  Shlomo Zilberstein,et al.  Decision-Theoretic Control of Planetary Rovers , 2001, Advances in Plan-Based Control of Robotic Agents.

[44]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[45]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[46]  R. Dechter Bucket Elimination: A Unifying Framework for Reasoning , 1999, Artif. Intell..

[47]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[48]  A. Barto,et al.  Introduction to Reinforcement Learning , 1998 .

[49]  Francesca Rossi,et al.  Semiring-based constraint satisfaction and optimization , 1997, JACM.

[50]  C. Ronald Kube,et al.  Task Modelling in Collective Robotics , 1997, Auton. Robots.

[51]  G. W. Wornell,et al.  Decentralized control of a multiple access broadcast channel: performance bounds , 1996, Proceedings of 35th IEEE Conference on Decision and Control.

[52]  R. Dechter Bucket Elimination: a Unifying Framework for Processing Hard and Soft Constraints , 1996, CSUR.

[53]  Benjamin Van Roy,et al.  Feature-based methods for large scale dynamic programming , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[54]  Thomas Schiex,et al.  Valued Constraint Satisfaction Problems: Hard and Easy Problems , 1995, IJCAI.

[55]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[56]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[57]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[58]  Steven Marcus,et al.  Decentralized control of a multiaccess broadcast network , 1981, 1981 20th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[59]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[60]  R. Bellman Dynamic Programming , 1957, Science.

[61]  A. Wald Contributions to the Theory of Statistical Estimation and Testing Hypotheses , 1939 .

[62]  Trey Smith,et al.  Probabilistic planning for robotic exploration , 2007 .

[63]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[64]  Judea Pearl,et al.  Some Recent Results in Heuristic Search Theory , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  William W. Cohen,et al.  Âóùöòòð Óó Öøø¬ Blockin , 2022 .