Monte-Carlo Expectation Maximization for Decentralized POMDPs

We address two significant drawbacks of state-of-the-art solvers of decentralized POMDPs (DECPOMDPs): the reliance on complete knowledge of the model and limited scalability as the complexity of the domain grows. We extend a recently proposed approach for solving DEC-POMDPs via a reduction to the maximum likelihood problem, which in turn can be solved using EM. We introduce a model-free version of this approach that employs Monte-Carlo EM (MCEM). While a naive implementation of MCEM is inadequate in multiagent settings, we introduce several improvements in sampling that produce high-quality results on a variety of DEC-POMDP benchmarks, including large problems with thousands of agents.

[1]  George A. Bekey,et al.  On autonomous robots , 1998, The Knowledge Engineering Review.

[2]  Ronald C. Neath,et al.  On Convergence Properties of the Monte Carlo EM Algorithm , 2012, 1206.4768.

[3]  Marc Toussaint,et al.  Scalable Multiagent Planning Using Probabilistic Inference , 2011, IJCAI.

[4]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[5]  Xinhua Zhang,et al.  Conditional random fields for multi-agent reinforcement learning , 2007, ICML '07.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Jaakko Peltonen,et al.  Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning , 2011, NIPS.

[9]  Bikramjit Banerjee,et al.  Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs , 2012, AAAI.

[10]  Robert M Thrall,et al.  Mathematics of Operations Research. , 1978 .

[11]  Marc Toussaint,et al.  Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.

[12]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[13]  Shlomo Zilberstein,et al.  Memory-Bounded Dynamic Programming for DEC-POMDPs , 2007, IJCAI.

[14]  Makoto Yokoo,et al.  Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings , 2003, IJCAI.

[15]  Shlomo Zilberstein,et al.  Policy Iteration for Decentralized Control of Markov Decision Processes , 2009, J. Artif. Intell. Res..

[16]  Feng Wu,et al.  Rollout Sampling Policy Iteration for Decentralized POMDPs , 2010, UAI.

[17]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[18]  François Charpillet,et al.  Mixed Integer Linear Programming for Exact Finite-Horizon Planning in Decentralized Pomdps , 2007, ICAPS.

[19]  François Charpillet,et al.  MAA*: A Heuristic Search Algorithm for Solving Decentralized POMDPs , 2005, UAI.

[20]  Makoto Yokoo,et al.  Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs , 2005, IJCAI.

[21]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[22]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[23]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[24]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[25]  Peter Kulchyski and , 2015 .

[26]  Marc Toussaint,et al.  Model-free reinforcement learning as mixture learning , 2009, ICML '09.

[27]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[28]  Frans A. Oliehoek,et al.  Scaling Up Optimal Heuristic Search in Dec-POMDPs via Incremental Expansion , 2011, IJCAI.

[29]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[30]  Jaakko Peltonen,et al.  Efficient Planning for Factored Infinite-Horizon DEC-POMDPs , 2011, IJCAI.

[31]  Brahim Chaib-draa,et al.  Toward error-bounded algorithms for infinite-horizon DEC-POMDPs , 2011, AAMAS.

[32]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[33]  Shlomo Zilberstein,et al.  Anytime Planning for Decentralized POMDPs using Expectation Maximization , 2010, UAI.

[34]  Victor R. Lesser,et al.  Coordinated Multi-Agent Reinforcement Learning in Networked Distributed POMDPs , 2011, AAAI.