HogRider: Champion Agent of Microsoft Malmo Collaborative AI Challenge

It has been an open challenge for self-interested agents to make optimal sequential decisions in complex multiagent systems, where agents might achieve higher utility via collaboration. The Microsoft Malmo Collaborative AI Challenge (MCAC), which is designed to encourage research relating to various problems in Collaborative AI, takes the form of a Minecraft mini-game where players might work together to catch a pig or deviate from cooperation, for pursuing high scores to win the challenge. Various characteristics, such as complex interactions among agents, uncertainties, sequential decision making and limited learning trials all make it extremely challenging to find effective strategies. We present HogRider the champion agent of MCAC in 2017 out of 81 teams from 26 countries. One key innovation of HogRider is a generalized agent type hypothesis framework to identify the behavior model of the other agents, which is demonstrated to be robust to observation uncertainty. On top of that, a second key innovation is a novel Q-learning approach to learn effective policies against each type of the collaborating agents. Various ideas are proposed to adapt traditional Qlearning to handle complexities in the challenge, including state-action abstraction to reduce problem scale, a warm start approach using human reasoning for addressing limited learning trials, and an active greedy strategy to balance exploitation-exploration. Challenge results show that HogRider outperforms all the other teams by a significant edge, in terms of both optimality and stability.

[1]  Eva Onaindia,et al.  Game-Theoretic Approach for Non-Cooperative Planning , 2015, AAAI.

[2]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[3]  Tuomas Sandholm,et al.  Endgame Solving in Large Imperfect-Information Games , 2015, AAAI Workshop: Computer Poker and Imperfect Information.

[4]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[5]  Sarit Kraus,et al.  Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination , 2010, AAAI.

[6]  Sarit Kraus,et al.  Empirical evaluation of ad hoc teamwork in the pursuit domain , 2011, AAMAS.

[7]  Milind Tambe,et al.  Towards Flexible Teamwork , 1997, J. Artif. Intell. Res..

[8]  Matthew E. Taylor,et al.  Improving Reinforcement Learning with Confidence-Based Demonstrations , 2017, IJCAI.

[9]  Branislav Bosanský,et al.  Heuristic Search Value Iteration for One-Sided Partially Observable Stochastic Games , 2017, AAAI.

[10]  Noa Agmon,et al.  Leading ad hoc agents in joint action settings with multiple teammates , 2012, AAMAS.

[11]  Jacob W. Crandall,et al.  Belief and Truth in Hypothesised Behaviours , 2015, Artif. Intell..

[12]  Michael H. Bowling,et al.  Bayes' Bluff: Opponent Modelling in Poker , 2005, UAI 2005.

[13]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[14]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[15]  Archie C. Chapman,et al.  Automated Planning in Repeated Adversarial Games , 2010, UAI.

[16]  Sarit Kraus,et al.  Teamwork with Limited Knowledge of Teammates , 2013, AAAI.

[17]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[18]  Tuomas Sandholm,et al.  Abstraction for Solving Large Incomplete-Information Games , 2015, AAAI.

[19]  Shlomo Zilberstein,et al.  Formal models and algorithms for decentralized decision making under uncertainty , 2008, Autonomous Agents and Multi-Agent Systems.

[20]  Sarit Kraus,et al.  Solving coalitional resource games , 2010, Artif. Intell..

[21]  Sarit Kraus,et al.  Collaborative Plans for Complex Group Action , 1996, Artif. Intell..

[22]  Mathijs de Weerdt,et al.  A better-response strategy for self-interested planning agents , 2017, Applied Intelligence.

[23]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[24]  Ronen I. Brafman,et al.  Planning Games , 2009, IJCAI.