Evaluating Multimodal Interactive Agents

Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast, controlled, interpretable, and representative of naturalistic interactions. Altogether, the STS consolidates much of what is desirable across many of our standard evaluation metrics, allowing us to accelerate research progress towards producing agents that can interact naturally with humans. https://youtu.be/YR1TngGORGQ

[1]  When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning? , 2022, 2204.05618.

[2]  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[3]  Gokhan Tur,et al.  TEACh: Task-driven Embodied Agents that Chat , 2021, ArXiv.

[4]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[5]  Tamara von Glehn,et al.  Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning , 2021, ArXiv.

[6]  Cristian-Paul Bara,et al.  MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks , 2021, EMNLP.

[7]  Zhiyi Ma,et al.  Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.

[8]  Felix Hill,et al.  Imitating Interactive Intelligence , 2020, ArXiv.

[9]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[10]  Yoav Artzi,et al.  Executing Instructions in Situated Collaborative Interactions , 2019, EMNLP/IJCNLP.

[11]  Jesse Thomason,et al.  Vision-and-Dialog Navigation , 2019, CoRL.

[12]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[13]  S. Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[14]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[15]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[16]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[17]  J. Duncan How Intelligence Happens , 2010 .

[18]  G. J. Robertson Raven's Progressive Matrices , 2010 .

[19]  Hugh Loebner How to Hold a Turing Test Contest , 2009 .

[20]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[21]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, Annual Meeting of the Association for Computational Linguistics.

[22]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23]  Wm. R. Wright General Intelligence, Objectively Determined and Measured. , .