Increasing scalability in algorithms for centralized and decentralized partially observable markov decision processes: efficient decision-making and coordination in uncertain environments

As agents are built for ever more complex environments, methods that consider the uncertainty in the system have strong advantages. This uncertainty is common in domains such as robot navigation, medical diagnosis and treatment, inventory management,  sensor networks and e-commerce. When a single decision maker is present, the partially observable Markov decision process (POMDP) model is a popular and powerful choice.  When choices are made in a decentralized manner by a set of decision makers, the problem can be modeled as a decentralized partially observable Markov decision process (DEC-POMDP).  While POMDPs and DEC-POMDPs offer rich frameworks for sequential decision making under uncertainty, the computational complexity of each model presents an important research challenge. As a way to address this high complexity, this thesis develops several solution methods based on utilizing domain structure, memory-bounded representations and sampling.  These approaches address some of the major bottlenecks for decision-making in real-world uncertain systems. The methods include a more efficient optimal algorithm for DEC-POMDPs as well as scalable approximate algorithms for POMDPs and DEC-POMDPs. Key contributions include optimizing compact representations as well as automatic structure extraction and exploitation.  These approaches increase the scalability of algorithms, while also increasing their solution quality.

[1]  Shlomo Zilberstein,et al.  Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs , 2010, Autonomous Agents and Multi-Agent Systems.

[2]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[3]  Yixin Chen,et al.  Solving Large-Scale Nonlinear Programming Problems by Constraint Partitioning , 2005, CP.

[4]  Shlomo Zilberstein,et al.  Achieving goals in decentralized POMDPs , 2009, AAMAS.

[5]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[6]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[7]  Marek Petrik,et al.  Interaction Structure and Dimensionality Reduction in Decentralized MDPs , 2008, AAAI.

[8]  S. Zilberstein,et al.  Formal Models and Algorithms for Decentralized Control of Multiple Agents Technical Report UM-CS-2005-068 , 2005 .

[9]  Zhengzhu Feng,et al.  Dynamic Programming for POMDPs Using a Factored State Representation , 2000, AIPS.

[10]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[11]  Victor R. Lesser,et al.  Analyzing myopic approaches for multi-agent communication , 2005, IEEE/WIC/ACM International Conference on Intelligent Agent Technology.

[12]  Claudia V. Goldman,et al.  Solving Transition Independent Decentralized Markov Decision Processes , 2004, J. Artif. Intell. Res..

[13]  Shlomo Zilberstein,et al.  Incremental Policy Generation for Finite-Horizon DEC-POMDPs , 2009, ICAPS.

[14]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[15]  Kee-Eung Kim,et al.  Symbolic Heuristic Search Value Iteration for Factored POMDPs , 2008, AAAI.

[16]  Edmund H. Durfee,et al.  Influence-Based Policy Abstraction for Weakly-Coupled Dec-POMDPs , 2010, ICAPS.

[17]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[18]  James E. Eckles,et al.  Optimum Maintenance with Incomplete Information , 1968, Oper. Res..

[19]  R. Simmons,et al.  Probabilistic Navigation in Partially Observable Environments , 1995 .

[20]  François Charpillet,et al.  MAA*: A Heuristic Search Algorithm for Solving Decentralized POMDPs , 2005, UAI.

[21]  Brian W. Kernighan,et al.  AMPL: A Modeling Language for Mathematical Programming , 1993 .

[22]  Blai Bonet,et al.  Solving POMDPs: RTDP-Bel vs. Point-based Algorithms , 2009, IJCAI.

[23]  Feng Wu,et al.  Multi-Agent Online Planning with Communication , 2009, ICAPS.

[24]  Makoto Yokoo,et al.  Not all agents are equal: scaling up distributed POMDPs for agent networks , 2008, AAMAS.

[25]  Eric A. Hansen,et al.  Synthesis of Hierarchical Finite-State Controllers for POMDPs , 2003, ICAPS.

[26]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[27]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[28]  Pascal Poupart,et al.  Model-based Bayesian Reinforcement Learning in Partially Observable Domains , 2008, ISAIM.

[29]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[30]  Jesse Hoey,et al.  Assisting persons with dementia during handwashing using a partially observable Markov decision process. , 2007, ICVS 2007.

[31]  Guy Shani,et al.  Model-Based Online Learning of POMDPs , 2005, ECML.

[32]  Daniel Kudenko,et al.  Reinforcement learning of coordination in cooperative multi-agent systems , 2002, AAAI/IAAI.

[33]  Pascal Poupart,et al.  Automated Hierarchy Discovery for Planning in Partially Observable Environments , 2006, NIPS.

[34]  Reid G. Simmons,et al.  Probabilistic Robot Navigation in Partially Observable Environments , 1995, IJCAI.

[35]  Craig Boutilier,et al.  Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[36]  Milind Tambe,et al.  The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models , 2011, J. Artif. Intell. Res..

[37]  R. Horst,et al.  Global Optimization: Deterministic Approaches , 1992 .

[38]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[39]  Joelle Pineau,et al.  Variable resolution decomposition for robotic navigation under a POMDP framework , 2010, 2010 IEEE International Conference on Robotics and Automation.

[40]  Craig Boutilier,et al.  Sequential Optimality and Coordination in Multiagent Systems , 1999, IJCAI.

[41]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[42]  H. Ector Geener Classical, Probabilistic, and Contingent Planning: Three Models, One Algorithm , 2022 .

[43]  Jan H. van Schuppen,et al.  A Class of Team Problems with Discrete Action Spaces: Optimality Conditions Based on Multimodularity , 2000, SIAM J. Control. Optim..

[44]  Victor R. Lesser,et al.  Communication decisions in multi-agent cooperation: model and experiments , 2001, AGENTS '01.

[45]  Eric A. Hansen,et al.  Indefinite-Horizon POMDPs with Action-Based Termination , 2007, AAAI.

[46]  J. Marschak,et al.  Elements for a Theory of Teams , 1955 .

[47]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[48]  Reid G. Simmons,et al.  Point-Based POMDP Algorithms: Improved Analysis and Implementation , 2005, UAI.

[49]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[50]  Claudia V. Goldman,et al.  Decentralized Control of Cooperative Systems: Categorization and Complexity Analysis , 2004, J. Artif. Intell. Res..

[51]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[52]  Benjamin Van Roy,et al.  An approximate dynamic programming approach to decentralized control of stochastic systems , 2006 .

[53]  Blai Bonet,et al.  Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs , 2010, AAAI.

[54]  Shlomo Zilberstein,et al.  Improved Memory-Bounded Dynamic Programming for Decentralized POMDPs , 2007, UAI.

[55]  Michael R. James,et al.  Learning predictive state representations in dynamical systems without reset , 2005, ICML.

[56]  Victor R. Lesser,et al.  Multi-agent policies: from centralized ones to decentralized ones , 2002, AAMAS '02.

[57]  Olivier Buffet,et al.  Multi-Agent Systems by Incremental Gradient Reinforcement Learning , 2001, IJCAI.

[58]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[59]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[60]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[61]  Victor R. Lesser,et al.  Decentralized Markov decision processes with event-driven interactions , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[62]  Shlomo Zilberstein,et al.  Bounded Policy Iteration for Decentralized POMDPs , 2005, IJCAI.

[63]  Shlomo Zilberstein,et al.  Policy Iteration for Decentralized Control of Markov Decision Processes , 2009, J. Artif. Intell. Res..

[64]  Marek Petrik,et al.  Average-Reward Decentralized Markov Decision Processes , 2007, IJCAI.

[65]  Guy Shani,et al.  Improving Existing Fault Recovery Policies , 2009, NIPS.

[66]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[67]  Shlomo Zilberstein,et al.  Value-based observation compression for DEC-POMDPs , 2008, AAMAS.

[68]  Brahim Chaib-draa,et al.  Exact Dynamic Programming for Decentralized POMDPs with Lossless Policy Compression , 2008, ICAPS.

[69]  Shlomo Zilberstein,et al.  Memory-Bounded Dynamic Programming for DEC-POMDPs , 2007, IJCAI.

[70]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[71]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[72]  Jeff G. Schneider,et al.  Approximate solutions for partially observable stochastic games with common payoffs , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[73]  François Charpillet,et al.  An Optimal Best-First Search Algorithm for Solving Infinite Horizon DEC-POMDPs , 2005, ECML.

[74]  Shlomo Zilberstein,et al.  Optimizing Memory-Bounded Controllers for Decentralized POMDPs , 2007, UAI.

[75]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[76]  R. Radner,et al.  Team Decision Problems , 1962 .

[77]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[78]  François Charpillet,et al.  Mixed Integer Linear Programming for Exact Finite-Horizon Planning in Decentralized Pomdps , 2007, ICAPS.

[79]  Makoto Yokoo,et al.  Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs , 2005, IJCAI.

[80]  A. Cassandra A Survey of POMDP Applications , 2003 .

[81]  Shimon Whiteson,et al.  Exploiting locality of interaction in factored Dec-POMDPs , 2008, AAMAS.

[82]  Brahim Chaib-draa,et al.  Point-based incremental pruning heuristic for solving finite-horizon DEC-POMDPs , 2009, AAMAS.

[83]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[84]  Claudia V. Goldman,et al.  Optimizing information exchange in cooperative multi-agent systems , 2003, AAMAS '03.

[85]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[86]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[87]  Shlomo Zilberstein,et al.  Formal models and algorithms for decentralized decision making under uncertainty , 2008, Autonomous Agents and Multi-Agent Systems.

[88]  Manuela M. Veloso,et al.  Reasoning about joint beliefs for execution-time communication decisions , 2005, AAMAS '05.

[89]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[90]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[91]  Martin Allen,et al.  Complexity of Decentralized Control: Special Cases , 2009, NIPS.

[92]  Shlomo Zilberstein,et al.  Solving POMDPs using quadratically constrained linear programs , 2006, AAMAS '06.

[93]  François Charpillet,et al.  Point-based Dynamic Programming for DEC-POMDPs , 2006, AAAI.

[94]  Anne Condon,et al.  On the undecidability of probabilistic planning and related stochastic optimization problems , 2003, Artif. Intell..

[95]  Marek Petrik,et al.  A Bilinear Programming Approach for Multiagent Planning , 2009, J. Artif. Intell. Res..

[96]  Hui Li,et al.  Point-Based Policy Iteration , 2007, AAAI.

[97]  Milos Hauskrecht,et al.  Modeling treatment of ischemic heart disease with partially observable Markov decision processes , 1998, AMIA.

[98]  Shimon Whiteson,et al.  Lossless clustering of histories in decentralized POMDPs , 2009, AAMAS.

[99]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[100]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[101]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[102]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[103]  Peter Stone,et al.  Learning Predictive State Representations , 2003, ICML.

[104]  Hui Li,et al.  Incremental Least Squares Policy Iteration for POMDPs , 2006, AAAI.

[105]  Craig Boutilier,et al.  Coordination in multiagent reinforcement learning: a Bayesian approach , 2003, AAMAS '03.

[106]  Geoffrey J. Gordon,et al.  Finding Approximate POMDP solutions Through Belief Compression , 2011, J. Artif. Intell. Res..

[107]  Michael A. Saunders,et al.  SNOPT: An SQP Algorithm for Large-Scale Constrained Optimization , 2002, SIAM J. Optim..

[108]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[109]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[110]  S. Patek On partially observed stochastic shortest path problems , 2001, Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No.01CH37228).

[111]  Makoto Yokoo,et al.  Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings , 2003, IJCAI.

[112]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[113]  P. J. Gmytrasiewicz,et al.  A Framework for Sequential Planning in Multi-Agent Settings , 2005, AI&M.

[114]  Yu-Chi Ho,et al.  Team decision theory and information structures , 1980 .

[115]  Xiaofeng Wang,et al.  Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games , 2002, NIPS.

[116]  Gerhard Weiss,et al.  Multiagent Systems , 1999 .

[117]  Claudia V. Goldman,et al.  Learning to communicate in a decentralized environment , 2007, Autonomous Agents and Multi-Agent Systems.

[118]  P. Poupart Exploiting structure to efficiently solve large scale partially observable Markov decision processes , 2005 .

[119]  Shlomo Zilberstein,et al.  Constraint-based dynamic programming for decentralized POMDPs with structured interactions , 2009, AAMAS.