Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination

An AI agent should be able to coordinate with humans to solve tasks. We consider the problem of training a Reinforcement Learning (RL) agent without using any human data, i.e., in a zero-shot setting, to make it capable of collaborating with humans. Standard RL agents learn through self-play. Unfortunately, these agents only know how to collaborate with themselves and normally do not perform well with unseen partners, such as humans. The methodology of how to train a robust agent in a zero-shot fashion is still subject to research. Motivated from the maximum entropy RL, we derive a centralized population entropy objective to facilitate learning of a diverse population of agents, which is later used to train a robust agent to collaborate with unseen partners. The proposed method shows its effectiveness compared to baseline methods, including self-play PPO, the standard Population-Based Training (PBT), and trajectory diversity-based PBT, in the popular Overcooked game environment. We also conduct online experiments with real humans and further demonstrate the efficacy of the method in the real world. A supplementary video showing experimental results is available at https://youtu.be/Xh-FKD0AAKE.

[1]  H. Piaggio Mathematical Analysis , 1955, Nature.

[2]  D. C. Englebart,et al.  Augmenting human intellect: a conceptual framework , 1962 .

[3]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[4]  Claude Sammut,et al.  A Framework for Behavioural Cloning , 1995, Machine Intelligence 15.

[5]  Craig Boutilier,et al.  Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[6]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[7]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[8]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[9]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[10]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[11]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[12]  Joshua B. Tenenbaum,et al.  Coordinate to cooperate or compete: Abstract goals and joint intentions in social interaction , 2016, CogSci.

[13]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[14]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[15]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[16]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[17]  Shan Carter,et al.  Using Artificial Intelligence to Augment Human Intelligence , 2017 .

[18]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[19]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20]  Greg Turk,et al.  Preparing for the Unknown: Learning a Universal Policy with Online System Identification , 2017, Robotics: Science and Systems.

[21]  Alexander Peysakhovich,et al.  Maintaining cooperation in complex social dilemmas using deep reinforcement learning , 2017, ArXiv.

[22]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[23]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[24]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[25]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[26]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[27]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Jason Weston,et al.  Vehicle Community Strategies , 2018, ArXiv.

[29]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[30]  Atil Iscen,et al.  Sim-to-Real: Learning Agile Locomotion For Quadruped Robots , 2018, Robotics: Science and Systems.

[31]  Alexander Peysakhovich,et al.  Learning Social Conventions in Markov Games , 2018, ArXiv.

[32]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[33]  Sergey Levine,et al.  Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[34]  Finale Doshi-Velez,et al.  Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies , 2019, IJCAI.

[35]  Sergey Levine,et al.  Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[36]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[37]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[38]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[39]  Joshua B. Tenenbaum,et al.  Theory of Minds: Understanding Behavior in Groups Through Inverse Planning , 2019, AAAI.

[40]  Rui Zhao,et al.  Maximum Entropy-Regularized Multi-Goal Reinforcement Learning , 2019, ICML.

[41]  Anca D. Dragan,et al.  On the Utility of Learning about Humans for Human-AI Coordination , 2019, NeurIPS.

[42]  Julie Shah,et al.  Adversarially Guided Self-Play for Adopting Social Conventions , 2020, ArXiv.

[43]  K. Choromanski,et al.  Effective Diversity in Population-Based Reinforcement Learning , 2020, NeurIPS.

[44]  Jakob N. Foerster,et al.  "Other-Play" for Zero-Shot Coordination , 2020, ICML.

[45]  Jiechao Xiong,et al.  TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game , 2020, ArXiv.

[46]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[47]  Pieter Abbeel,et al.  Mutual Information State Intrinsic Control , 2021, ICLR.

[48]  S. Du,et al.  Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization , 2021, ICLR.

[49]  Hengyuan Hu,et al.  Trajectory Diversity for Zero-Shot Coordination , 2021, AAMAS.

[50]  Sam Devlin,et al.  Evaluating the Robustness of Collaborative Agents , 2021, AAMAS.