Language Instructed Reinforcement Learning for Human-AI Coordination

One of the fundamental quests of AI is to produce agents that coordinate well with humans. This problem is challenging, especially in domains that lack high quality human behavioral data, because multi-agent reinforcement learning (RL) often converges to different equilibria from the ones that humans prefer. We propose a novel framework, instructRL, that enables humans to specify what kind of strategies they expect from their AI partners through natural language instructions. We use pretrained large language models to generate a prior policy conditioned on the human instruction and use the prior to regularize the RL objective. This leads to the RL agent converging to equilibria that are aligned with human preferences. We show that instructRL converges to human-like policies that satisfy the given instructions in a proof-of-concept environment as well as the challenging Hanabi benchmark. Finally, we show that knowing the language instruction significantly boosts human-AI coordination performance in human evaluations in Hanabi.

[1]  Sang Michael Xie,et al.  Reward Design with Language Models , 2023, ICLR.

[2]  Alexander H. Miller,et al.  Human-level play in the game of Diplomacy by combining language models with strategic reasoning , 2022, Science.

[3]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[4]  Alexander H. Miller,et al.  Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning , 2022, ArXiv.

[5]  Hengyuan Hu,et al.  K-level Reasoning for Zero-Shot Coordination in Hanabi , 2022, NeurIPS.

[6]  Anima Anandkumar,et al.  MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge , 2022, NeurIPS.

[7]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[8]  Robert D. Hawkins,et al.  Using Natural Language and Program Abstractions to Instill Human Inductive Biases in Machines , 2022, NeurIPS.

[9]  A. Dragan,et al.  The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models , 2022, ICLR.

[10]  Andrew Kyle Lampinen,et al.  Semantic Exploration from Language Abstractions and Pretrained Representations , 2022, NeurIPS.

[11]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[12]  Noah D. Goodman,et al.  Improving Intrinsic Exploration with Language Abstractions , 2022, NeurIPS.

[13]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[14]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[15]  Richard Everett,et al.  Collaborating with Humans without Human Data , 2021, NeurIPS.

[16]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[17]  Stefano Ermon,et al.  On the Critical Role of Conventions in Adaptive Human-AI Collaboration , 2021, ICLR.

[18]  Dorsa Sadigh,et al.  ELLA: Exploration through Learned Language Abstraction , 2021, NeurIPS.

[19]  Brandon Cui,et al.  Off-Belief Learning , 2021, ICML.

[20]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[21]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[22]  Jakob N. Foerster,et al.  "Other-Play" for Zero-Shot Coordination , 2020, ICML.

[23]  Anca D. Dragan,et al.  On the Utility of Learning about Humans for Human-AI Coordination , 2019, NeurIPS.

[24]  Tom B. Brown,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[25]  H. Francis Song,et al.  The Hanabi Challenge: A New Frontier for AI Research , 2019, Artif. Intell..

[26]  Alexander Peysakhovich,et al.  Learning Social Conventions in Markov Games , 2018, ArXiv.

[27]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[29]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[30]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[31]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[32]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[33]  Nat Dilokthanakul,et al.  Generating Diverse Cooperative Agents by Learning Incompatible Policies , 2023, ICLR.

[34]  Hengyuan Hu,et al.  Trajectory Diversity for Zero-Shot Coordination , 2021, AAMAS.