Solving Finite Horizon Decentralized POMDPs by Distributed Reinforcement Learning

Decentralized partially observable Markov decision processes (Dec-POMDPs) offer a powerful modeling technique for realistic multi-agent coordination problems under uncertainty. Prevalent solution techniques are centralized and assume prior knowledge of the model. We propose a distributed reinforcement learning approach, where agents take turns to learn best responses to each other’s policies. This promotes decentralization of the policy computation problem, and relaxes reliance on the full knowledge of the problem parameters. We derive the relation between the sample complexity of best response learning and error tolerance. Our key contribution is to show that even the“per-leaf”sample complexity could grow exponentially with the problem horizon. We show empirically that even if the sample requirement is set lower than what theory demands, our learning approach can produce (near) optimal policies in some benchmark DecPOMDP problems. We also propose a slight modification that empirically appears to significantly reduce the learning time with relatively little impact on the quality of learned policies.

[1]  Makoto Yokoo,et al.  Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings , 2003, IJCAI.

[2]  Guy Shani,et al.  Model-Based Online Learning of POMDPs , 2005, ECML.

[3]  Victor R. Lesser,et al.  Coordinated Multi-Agent Reinforcement Learning in Networked Distributed POMDPs , 2011, AAAI.

[4]  Jeff G. Schneider,et al.  Approximate solutions for partially observable stochastic games with common payoffs , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[5]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[6]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[9]  Frans A. Oliehoek,et al.  Heuristic search for identical payoff Bayesian games , 2010, AAMAS.

[10]  Shlomo Zilberstein,et al.  Optimizing Memory-Bounded Controllers for Decentralized POMDPs , 2007, UAI.

[11]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[12]  Bikramjit Banerjee,et al.  Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs , 2012, AAAI.

[13]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[14]  Frans A. Oliehoek,et al.  Scaling Up Optimal Heuristic Search in Dec-POMDPs via Incremental Expansion , 2011, IJCAI.

[15]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[16]  Bikramjit Banerjee,et al.  Informed Initial Policies for Learning in Dec-POMDPs , 2012, AAAI.

[17]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[18]  Shlomo Zilberstein,et al.  Memory-Bounded Dynamic Programming for DEC-POMDPs , 2007, IJCAI.

[19]  François Charpillet,et al.  Point-based Dynamic Programming for DEC-POMDPs , 2006, AAAI.