论文信息 - Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination - 字舞流文

Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination

An AI agent should be able to coordinate with humans to solve tasks. We consider the problem of training a Reinforcement Learning (RL) agent without using any human data, i.e., in a zero-shot setting, to make it capable of collaborating with humans. Standard RL agents learn through self-play. Unfortunately, these agents only know how to collaborate with themselves and normally do not perform well with unseen partners, such as humans. The methodology of how to train a robust agent in a zero-shot fashion is still subject to research. Motivated from the maximum entropy RL, we derive a centralized population entropy objective to facilitate learning of a diverse population of agents, which is later used to train a robust agent to collaborate with unseen partners. The proposed method shows its effectiveness compared to baseline methods, including self-play PPO, the standard Population-Based Training (PBT), and trajectory diversity-based PBT, in the popular Overcooked game environment. We also conduct online experiments with real humans and further demonstrate the efficacy of the method in the real world. A supplementary video showing experimental results is available at https://youtu.be/Xh-FKD0AAKE.

Yi Wu | Zhongqian Sun | Rui Zhao | Haifeng Hu | Yang Gao | Jinming Song | Yang Wei | Yi Wu | Rui Zhao | Jinming Song | Haifeng Hu | Yang Gao | Zhongqian Sun | Yang Wei

[1] H. Piaggio. Mathematical Analysis , 1955, Nature.

[2] D. C. Englebart,et al. Augmenting human intellect: a conceptual framework , 1962 .

[3] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[4] Claude Sammut,et al. A Framework for Behavioural Cloning , 1995, Machine Intelligence 15.

[5] Craig Boutilier,et al. Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[6] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[7] Marc Toussaint,et al. Robot trajectory optimization using approximate inference , 2009, ICML '09.

[8] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[9] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[10] Marc Toussaint,et al. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[11] Shimon Whiteson,et al. Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[12] Joshua B. Tenenbaum,et al. Coordinate to cooperate or compete: Abstract goals and joint intentions in social interaction , 2016, CogSci.

[13] Roy Fox,et al. Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[14] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[15] Michael I. Jordan,et al. Gradient Descent Only Converges to Minimizers , 2016, COLT.

[16] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[17] Shan Carter,et al. Using Artificial Intelligence to Augment Human Intelligence , 2017 .

[18] Demis Hassabis,et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[19] Wojciech Zaremba,et al. Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20] Greg Turk,et al. Preparing for the Unknown: Learning a Universal Policy with Online System Identification , 2017, Robotics: Science and Systems.

[21] Alexander Peysakhovich,et al. Maintaining cooperation in complex social dilemmas using deep reinforcement learning , 2017, ArXiv.

[22] Yi Wu,et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[23] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[24] Max Jaderberg,et al. Population Based Training of Neural Networks , 2017, ArXiv.

[25] Pieter Abbeel,et al. Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[26] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[27] Marcin Andrychowicz,et al. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[28] Jason Weston,et al. Vehicle Community Strategies , 2018, ArXiv.

[29] Shimon Whiteson,et al. Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[30] Atil Iscen,et al. Sim-to-Real: Learning Agile Locomotion For Quadruped Robots , 2018, Robotics: Science and Systems.

[31] Alexander Peysakhovich,et al. Learning Social Conventions in Markov Games , 2018, ArXiv.

[32] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[33] Sergey Levine,et al. Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[34] Finale Doshi-Velez,et al. Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies , 2019, IJCAI.

[35] Sergey Levine,et al. Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[36] Marcin Andrychowicz,et al. Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[37] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[38] Sergey Levine,et al. Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[39] Joshua B. Tenenbaum,et al. Theory of Minds: Understanding Behavior in Groups Through Inverse Planning , 2019, AAAI.

[40] Rui Zhao,et al. Maximum Entropy-Regularized Multi-Goal Reinforcement Learning , 2019, ICML.

[41] Anca D. Dragan,et al. On the Utility of Learning about Humans for Human-AI Coordination , 2019, NeurIPS.

[42] Julie Shah,et al. Adversarially Guided Self-Play for Adopting Social Conventions , 2020, ArXiv.

[43] K. Choromanski,et al. Effective Diversity in Population-Based Reinforcement Learning , 2020, NeurIPS.

[44] Jakob N. Foerster,et al. "Other-Play" for Zero-Shot Coordination , 2020, ICML.

[45] Jiechao Xiong,et al. TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game , 2020, ArXiv.

[46] Jakub W. Pachocki,et al. Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[47] Pieter Abbeel,et al. Mutual Information State Intrinsic Control , 2021, ICLR.

[48] S. Du,et al. Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization , 2021, ICLR.

[49] Hengyuan Hu,et al. Trajectory Diversity for Zero-Shot Coordination , 2021, AAMAS.

[50] Sam Devlin,et al. Evaluating the Robustness of Collaborative Agents , 2021, AAMAS.