Accelerating Training in Pommerman with Imitation and Reinforcement Learning

The Pommerman simulation was recently developed to mimic the classic Japanese game Bomberman, and focuses on competitive gameplay in a multi-agent setting. We focus on the 2$\times$2 team version of Pommerman, developed for a competition at NeurIPS 2018. Our methodology involves training an agent initially through imitation learning on a noisy expert policy, followed by a proximal-policy optimization (PPO) reinforcement learning algorithm. The basic PPO approach is modified for stable transition from the imitation learning phase through reward shaping, action filters based on heuristics, and curriculum learning. The proposed methodology is able to beat heuristic and pure reinforcement learning baselines with a combined 100,000 training games, significantly faster than other non-tree-search methods in literature. We present results against multiple agents provided by the developers of the simulation, including some that we have enhanced. We include a sensitivity analysis over different parameters, and highlight undesirable effects of some strategies that initially appear promising. Since Pommerman is a complex multi-agent competitive environment, the strategies developed here provide insights into several real-world problems with characteristics such as partial observability, decentralized execution (without communication), and very sparse and delayed rewards.

[1]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[2]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[3]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[4]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[5]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[6]  Balaraman Ravindran,et al.  Dynamic Action Repetition for Deep Reinforcement Learning , 2017, AAAI.

[7]  Sean Luke,et al.  Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[8]  Andrew J. Davison,et al.  Sim-to-Real Reinforcement Learning for Deformable Object Manipulation , 2018, CoRL.

[9]  Matthew E. Taylor,et al.  Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL , 2018, ArXiv.

[10]  Chao Gao,et al.  Skynet: A Top Deep RL Agent in the Inaugural Pommerman Team Competition , 2019, ArXiv.

[11]  Matthew E. Taylor,et al.  Agent Modeling as Auxiliary Task for Deep Reinforcement Learning , 2019, AIIDE.

[12]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[13]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[14]  T. Das,et al.  A Reinforcement Learning Model to Assess Market Power Under Auction-Based Energy Pricing , 2007, IEEE Transactions on Power Systems.

[15]  Takayuki Osogami,et al.  Real-time tree search with pessimistic scenarios , 2019, ArXiv.

[16]  Julian Togelius,et al.  A hybrid search agent in pommerman , 2018, FDG.

[17]  Yang Yu,et al.  Towards Sample Efficient Reinforcement Learning , 2018, IJCAI.

[18]  Simon M. Lucas,et al.  Analysis of Statistical Forward Planning Methods in Pommerman , 2019, AIIDE.

[19]  Dhruv Shah,et al.  Multi-Agent Strategies for Pommerman , 2018 .

[20]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[21]  Joan Bruna,et al.  Backplay: "Man muss immer umkehren" , 2018, ArXiv.

[22]  Chao Gao,et al.  Continual Match Based Training in Pommerman: Technical Report , 2018, ArXiv.

[23]  Laurent Jeanpierre,et al.  Coordinated Multi-Robot Exploration Under Communication Constraints Using Decentralized Markov Decision Processes , 2012, AAAI.

[24]  Balaraman Ravindran,et al.  Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning , 2017, ICLR.

[25]  Julian Togelius,et al.  Pommerman: A Multi-Agent Playground , 2018, AIIDE Workshops.

[26]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[27]  Matthew E. Taylor,et al.  Safer Deep RL with Shallow MCTS: A Case Study in Pommerman , 2019, ArXiv.

[28]  Daniel Kudenko,et al.  Deep Multi-Agent Reinforcement Learning with Relevance Graphs , 2018, ArXiv.