Multi-Modal Repairs of Conversational Breakdowns in Task-Oriented Dialogs

A major problem in task-oriented conversational agents is the lack of support for the repair of conversational breakdowns. Prior studies have shown that current repair strategies for these kinds of errors are often ineffective due to: (1) the lack of transparency about the state of the system's understanding of the user's utterance; and (2) the system's limited capabilities to understand the user's verbal attempts to repair natural language understanding errors. This paper introduces SOVITE, a new multi-modal speech plus direct manipulation interface that helps users discover, identify the causes of, and recover from conversational breakdowns using the resources of existing mobile app GUIs for grounding. SOVITE displays the system's understanding of user intents using GUI screenshots, allows users to refer to third-party apps and their GUI screens in conversations as inputs for intent disambiguation, and enables users to repair breakdowns using direct manipulation on these screenshots. The results from a remote user study with 10 users using SOVITE in 7 scenarios suggested that SOVITE's approach is usable and effective.

[1]  Sarah Sharples,et al.  Voice Interfaces in Everyday Life , 2018, CHI.

[2]  Joyce Yue Chai,et al.  Interactive Learning of Grounded Verb Semantics towards Human-Robot Communication , 2017, ACL.

[3]  Yuanchun Li,et al.  Why Are They Collecting My Data? , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[4]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[5]  Alexander H. Waibel,et al.  Multimodal error correction for speech user interfaces , 2001, TCHI.

[6]  Zhenchang Xing,et al.  Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[7]  Janghee Cho,et al.  The Role of Conversational Grounding in Supporting Symbiosis Between People and Digital Assistants , 2020, Proc. ACM Hum. Comput. Interact..

[8]  Thomas G. Dietterich,et al.  Toward harnessing user feedback for machine learning , 2007, IUI '07.

[9]  Abigail Sellen,et al.  "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents , 2016, CHI.

[10]  Jason C. Yip,et al.  Communication Breakdowns Between Families and Alexa , 2019, CHI.

[11]  Jonathan Grudin,et al.  Chatbots, Humbots, and the Quest for Artificial General Intelligence , 2019, CHI.

[12]  Shwetak N. Patel,et al.  FarmChat , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[13]  Jichen Zhu,et al.  Patterns for How Users Overcome Obstacles in Voice User Interfaces , 2018, CHI.

[14]  Daqing He,et al.  How do users respond to voice input errors?: lexical and phonetic query reformulation in voice search , 2013, SIGIR.

[15]  Zhoujun Li,et al.  A Sequential Matching Framework for Multi-Turn Response Selection in Retrieval-Based Chatbots , 2017, CL.

[16]  Zahra Ashktorab,et al.  Resilient Chatbots: Repair Strategy Preferences for Conversational Breakdowns , 2019, CHI.

[17]  Jian Xu,et al.  Voice enabling mobile applications with UIVoice , 2018, EdgeSys@MobiSys.

[18]  Ben Shneiderman,et al.  Direct Manipulation: A Step Beyond Programming Languages , 1983, Computer.

[19]  Ranjitha Kumar,et al.  ERICA: Interaction Mining Mobile Apps , 2016, UIST.

[20]  Michael F. McTear,et al.  Handling errors and determining confirmation strategies - An object-based approach , 2003, Speech Commun..

[21]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[22]  James D. Hollan,et al.  Direct Manipulation Interfaces , 1985, Hum. Comput. Interact..

[23]  Sharon L. Oviatt,et al.  Mutual disambiguation of recognition errors in a multimodel architecture , 1999, CHI '99.

[24]  Oriana Riva,et al.  Kite: Building Conversational Bots from Mobile Apps , 2018, MobiSys.

[25]  Tom M. Mitchell,et al.  Interactive Task Learning from GUI-Grounded Natural Language Instructions and Demonstrations , 2020, ACL.

[26]  Aniket Kittur,et al.  Unakite: Scaffolding Developers' Decision-Making Using the Web , 2019, UIST.

[27]  Benjamin R. Cowan,et al.  "What can i help you with?": infrequent users' experiences of intelligent personal assistants , 2017, MobileHCI.

[28]  Jacob O. Wobbrock,et al.  Interaction Proxies for Runtime Repair and Enhancement of Mobile Application Accessibility , 2017, CHI.

[29]  Sharon L. Oviatt,et al.  Ten myths of multimodal interaction , 1999, Commun. ACM.

[30]  Tom M. Mitchell,et al.  PUMICE: A Multi-Modal Agent that Learns Concepts and Conditionals from Natural Language and Demonstrations , 2019, UIST.

[31]  D. Norman The Design of Everyday Things: Revised and Expanded Edition , 2013 .

[32]  Thomas F. Liu,et al.  Learning Design Semantics for Mobile Apps , 2018, UIST.

[33]  Daniel G. Bobrow,et al.  GUS, A Frame-Driven Dialog System , 1986, Artif. Intell..

[34]  Percy Liang,et al.  Mapping natural language commands to web elements , 2018, EMNLP.

[35]  Xiang 'Anthony' Chen,et al.  Geno: A Developer Tool for Authoring Multimodal Interaction on Existing Web Applications , 2020, UIST.

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Tom M. Mitchell,et al.  Joint Concept Learning and Semantic Parsing from Natural Language Explanations , 2017, EMNLP.

[38]  Krish Perumal,et al.  VASTA: a vision and language-assisted smartphone task automation system , 2019, IUI.

[39]  Herbert H. Clark,et al.  Grounding in communication , 1991, Perspectives on socially shared cognition.

[40]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[41]  Nathanael Chambers,et al.  PLOW: A Collaborative Task Learning Agent , 2007, AAAI.

[42]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[43]  Amos Azaria,et al.  Instructable Intelligent Personal Agent , 2016, AAAI.

[44]  Shwetak N. Patel,et al.  Evaluating and Informing the Design of Chatbots , 2018, Conference on Designing Interactive Systems.

[45]  Alessandra Gorla,et al.  Checking app behavior against app descriptions , 2014, ICSE.

[46]  Alexander I. Rudnicky,et al.  Sorry and I Didn’t Catch That! - An Investigation of Non-understanding Errors and Recovery Strategies , 2005, SIGDIAL.

[47]  Brad A. Myers,et al.  Making End User Development More Natural , 2017, New Perspectives in End-User Development.

[48]  Eben M. Haber,et al.  CoScripter: automating & sharing how-to knowledge in the enterprise , 2008, CHI.

[49]  Thomas R. G. Green,et al.  Cognitive dimensions of notations , 1990 .

[50]  Brent J. Hecht,et al.  Leveraging advances in natural language processing to better understand Tobler's first law of geography , 2014, SIGSPATIAL/GIS.

[51]  Susan E. Brennan,et al.  The Grounding Problem in Conversations With and Through Computers , 2000 .

[52]  Amos Azaria,et al.  SUGILITE: Creating Multimodal Smartphone Automation by Demonstration , 2017, CHI.

[53]  Amos Azaria,et al.  Teaching Agents When They Fail: End User Development in Goal-Oriented Conversational Agents , 2018, Studies in Conversational UX Design.

[54]  John E. Laird,et al.  Learning Hierarchical Symbolic Representations to Support Interactive Task Learning and Knowledge Transfer , 2019, IJCAI.

[55]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[56]  Gregory D. Abowd,et al.  OOPS: a toolkit supporting mediation techniques for resolving ambiguity in recognition-based interfaces , 2000, Comput. Graph..

[57]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[58]  Eric Horvitz,et al.  Principles of mixed-initiative user interfaces , 1999, CHI '99.

[59]  Takeo Igarashi,et al.  Speech pen: predictive handwriting based on ambient multimodal recognition , 2006, CHI.

[60]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[61]  Geoffrey Zweig,et al.  Toward Human Parity in Conversational Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[62]  Geraldine P. Wallach,et al.  Language Learning Disabilities in School-Age Children and Adolescents: Some Principles and Applications , 1994 .

[63]  Fanglin Chen,et al.  MessageOnTap: A Suggestive Interface to Facilitate Messaging-related Tasks , 2019, CHI.

[64]  Jeffrey Nichols,et al.  Rico: A Mobile App Dataset for Building Data-Driven Design Applications , 2017, UIST.

[65]  Frank Bentley,et al.  Understanding the Long-Term Use of Smart Speaker Assistants , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..