Evaluating Visual Conversational Agents via Cooperative Human-AI Games

As AI continues to advance, human-AI teams are inevitable. However, progress in AI is routinely measured in isolation, without a human in the loop. It is crucial to benchmark progress in AI, not just in isolation, but also in terms of how it translates to helping humans perform certain tasks, i.e., the performance of human-AI teams. In this work, we design a cooperative game - GuessWhich - to measure human-AI team performance in the specific context of the AI being a visual conversational agent. GuessWhich involves live interaction between the human and the AI. The AI, which we call ALICE, is provided an image which is unseen by the human. Following a brief description of the image, the human questions ALICE about this secret image to identify it from a fixed pool of images. We measure performance of the human-ALICE team by the number of guesses it takes the human to correctly identify the secret image after a fixed number of dialog rounds with ALICE. We compare performance of the human-ALICE teams for two versions of ALICE. Our human studies suggest a counterintuitive trend - that while AI literature shows that one version outperforms the other when paired with an AI questioner bot, we find that this improvement in AI-AI performance does not translate to improved human-AI performance. This suggests a mismatch between benchmarking of AI in isolation and in the context of human-AI teams.

[1]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Irfan A. Essa,et al.  Audio Puzzler: piecing together time-stamped speech transcripts with a puzzle game , 2008, ACM Multimedia.

[3]  Hidir Aras,et al.  Webpardy: harvesting QA by HC , 2010, HCOMP '10.

[4]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[5]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[6]  Jan D. Smeddinck,et al.  Human computation games: A survey , 2011, 2011 19th European Signal Processing Conference.

[7]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[8]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[9]  Pietro Michelucci,et al.  Handbook of Human Computation , 2013, Springer New York.

[10]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[14]  Devi Parikh,et al.  It Takes Two to Tango: Towards Theory of AI's Mind , 2017, ArXiv.

[15]  Olivier Pietquin,et al.  End-to-end optimization of goal-driven and visually grounded dialogue systems , 2017, IJCAI.

[16]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Manuel Blum,et al.  Peekaboom: a game for locating objects in images , 2006, CHI.

[18]  Barbara J. Grosz,et al.  What Question Would Turing Pose Today? , 2012, AI Mag..

[19]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[20]  Edith Law,et al.  Input-agreement: a new mechanism for collecting data using human computation games , 2009, CHI.

[21]  Laura A. Dabbish,et al.  Designing games with a purpose , 2008, CACM.

[22]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Tim Paek Empirical Methods for Evaluating Dialog Systems , 2001, SIGDIAL Workshop.

[24]  Udo Kruschwitz,et al.  Phrase Detectives: A Web-based collaborative annotation game , 2008 .

[25]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[26]  David C. Parkes,et al.  A game-theoretic analysis of the ESP game , 2013, TEAC.

[27]  Roger B. Dannenberg,et al.  TagATune: A Game for Music and Sound Annotation , 2007, ISMIR.