Towards Ecologically Valid Research on Language User Interfaces

Language User Interfaces (LUIs) could improve human-machine interaction for a wide variety of tasks, such as playing music, getting insights from databases, or instructing domestic robots. In contrast to traditional hand-crafted approaches, recent work attempts to build LUIs in a data-driven way using modern deep learning methods. To satisfy the data needs of such learning algorithms, researchers have constructed benchmarks that emphasize the quantity of collected data at the cost of its naturalness and relevance to real-world LUI use cases. As a consequence, research findings on such benchmarks might not be relevant for developing practical LUIs. The goal of this paper is to bootstrap the discussion around this issue, which we refer to as the benchmarks' low ecological validity. To this end, we describe what we deem an ideal methodology for machine learning research on LUIs and categorize five common ways in which recent benchmarks deviate from it. We give concrete examples of the five kinds of deviations and their consequences. Lastly, we offer a number of recommendations as to how to increase the ecological validity of machine learning research on LUIs.

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Elina Meier,et al.  Wizard-of-Oz Studies. , 2011 .

[4]  Tao Yu,et al.  SParC: Cross-Domain Semantic Parsing in Context , 2019, ACL.

[5]  Arne Jönsson,et al.  Empirical Studies Of Discourse Representations For Natural Language Interfaces , 1989, EACL.

[6]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[7]  Ming-Wei Chang,et al.  Exploring Unexplored Generalization Challenges for Cross-Database Semantic Parsing , 2020, ACL.

[8]  R. Gelman,et al.  The development of communication skills: modifications in the speech of young children as a function of listener. , 1973, Monographs of the Society for Research in Child Development.

[9]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[10]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[11]  Hans Uszkoreit,et al.  Contextual phenomena and thematic relations in database QA dialogues: results from a Wizard-of-Oz Experiment , 2006, HLT-NAACL 2006.

[12]  Jason Weston,et al.  Why Build an Assistant in Minecraft? , 2019, ArXiv.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[15]  Saul Greenberg,et al.  Prototyping an intelligent agent through Wizard of Oz , 1993, INTERCHI.

[16]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Jaime G. Carbonell,et al.  Discourse Pragmatics and Ellipsis Resolution in Task-Oriented Natural Language Interfaces , 1983, ACL.

[18]  M. Brewer,et al.  Research Design and Issues of Validity , 2000 .

[19]  Abraham Bernstein,et al.  A comparative survey of recent natural language interfaces for databases , 2019, The VLDB Journal.

[20]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[22]  Luyao Chen,et al.  CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases , 2019, EMNLP.

[23]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  J. F. Kelley,et al.  An iterative design methodology for user-friendly natural language office information applications , 1984, TOIS.

[25]  Dragomir R. Radev,et al.  Improving Text-to-SQL Evaluation Methodology , 2018, ACL.

[26]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[27]  Stefan Ultes,et al.  MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , 2018, EMNLP.

[28]  Jonathan Berant,et al.  Building a Semantic Parser Overnight , 2015, ACL.

[29]  Yi Pan,et al.  Conversational AI: The Science Behind the Alexa Prize , 2018, ArXiv.

[30]  Alexander I. Rudnicky,et al.  Expanding the Scope of the ATIS Task: The ATIS-3 Corpus , 1994, HLT.

[31]  Jacob Andreas,et al.  Good-Enough Compositional Data Augmentation , 2019, ACL.

[32]  Abigail Sellen,et al.  "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents , 2016, CHI.

[33]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yoav Artzi,et al.  Executing Instructions in Situated Collaborative Interactions , 2019, EMNLP.

[35]  Roozbeh Mottaghi,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jason Weston,et al.  Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[37]  Anoop K. Sinha,et al.  Embarking on Spoken-Language NL Interface Design , 2002, Int. J. Speech Technol..

[38]  Xinlei Chen,et al.  CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication , 2017, ACL.

[39]  Emiel Krahmer,et al.  Computational Generation of Referring Expressions: A Survey , 2012, CL.

[40]  Monica S. Lam,et al.  Genie: a generator of natural language semantic parsers for virtual assistant commands , 2019, PLDI.

[41]  SUSAN E. BRENNAN,et al.  Conversation with and through computers , 1991, User Modeling and User-Adapted Interaction.

[42]  Nigel Gilbert,et al.  Simulating speech systems , 1991 .

[43]  Harry Shum,et al.  The Design and Implementation of XiaoIce, an Empathetic Social Chatbot , 2018, CL.

[44]  Paul N. Bennett,et al.  Toward whole-session relevance: exploring intrinsic diversity in web search , 2013, SIGIR.

[45]  Xiao Wang,et al.  Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , 2019, ICLR.

[46]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[47]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[48]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[49]  Benjamin R. Cowan,et al.  "What can i help you with?": infrequent users' experiences of intelligent personal assistants , 2017, MobileHCI.

[50]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[51]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Barbara G Deutsch,et al.  The Structure of Task Oriented Dialogs , 1974 .

[53]  Xinyan Xiao,et al.  DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications , 2017, QA@ACL.

[54]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.