ScratchThat: Supporting Command-Agnostic Speech Repair in Voice-Driven Assistants

Speech interfaces have become an increasingly popular input method for smartphone-based virtual assistants, smart speakers, and Internet of Things (IoT) devices. While they facilitate rapid and natural interaction in the form of voice commands, current speech interfaces lack natural methods for command correction. We present ScratchThat, a method for supporting command-agnostic speech repair in voice-driven assistants, suitable for enabling corrective functionality within third-party commands. Unlike existing speech repair methods, ScratchThat is able to automatically infer query parameters and intelligently select entities in a correction clause for editing. We conducted three evaluations to (1) elicit natural forms of speech repair in voice commands, (2) compare the interaction speed and NASA TLX score of the system to existing voice-based correction methods, and (3) assess the accuracy of the ScratchThat algorithm. Our results show that (1) speech repair for voice commands differ from previous models for conversational speech repair, (2) methods for command correction based on speech repair are significantly faster than other voice-based methods, and (3) the ScratchThat algorithm facilitates accurate command repair as rated by humans (77% accuracy) and machines (0.94 BLEU score). Finally, we present several ScratchThat use cases, which collectively demonstrate its utility across many applications.

[1]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[2]  Mark Johnson,et al.  Detecting Speech Repairs Incrementally Using a Noisy Channel Approach , 2010, COLING.

[3]  Stefan Scherer,et al.  NADiA: Neural Network Driven Virtual Human Conversation Agents , 2018, IVA.

[4]  Ben Shneiderman,et al.  The limits of speech recognition , 2000, CACM.

[5]  Louis-Philippe Morency,et al.  Affect-LM: A Neural Language Model for Customizable Affective Text Generation , 2017, ACL.

[6]  Daniel O’Sullivan,et al.  Using an Adaptive Voice User Interface to Gain Efficiencies in Automated Calls , 2009 .

[7]  Boris Katz,et al.  A Natural Language Interface for Mobile Devices , 2018 .

[8]  Ramanathan V. Guha,et al.  User Modeling for a Personal Assistant , 2015, WSDM.

[9]  Julian Hough,et al.  Recurrent neural networks for incremental disfluency detection , 2015, INTERSPEECH.

[10]  Kent Lyons,et al.  Augmenting conversations using dual-purpose speech , 2004, UIST '04.

[11]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[12]  Chelsea Dobbins,et al.  A natural language query interface for searching personal information on smartwatches , 2016, 2017 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops).

[13]  Aung Pyae,et al.  Investigating the usability and user experiences of voice user interface: a case of Google home smart speaker , 2018, MobileHCI Adjunct.

[14]  James A. Landay,et al.  Comparing Speech and Keyboard Text Entry for Short Messages in Two Languages on Touchscreen Phones , 2016, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[15]  James F. Allen,et al.  Speech repains, intonational phrases, and discourse markers: modeling speakers’ utterances in spoken dialogue , 1999, CL.

[16]  Michael S. Bernstein,et al.  Iris: A Conversational Agent for Complex Tasks , 2017, CHI.

[17]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[19]  Jan Niehues,et al.  Multilingual Disfluency Removal using NMT , 2016 .

[20]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[21]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[22]  Mari Ostendorf,et al.  Disfluency Detection Using a Bidirectional LSTM , 2016, INTERSPEECH.

[23]  Regina Barzilay,et al.  Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2017, ACL 2017.

[24]  Colin Cherry,et al.  A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU , 2014, WMT@ACL.

[25]  Jichen Zhu,et al.  Learnability through Adaptive Discovery Tools in Voice User Interfaces , 2017, CHI Extended Abstracts.

[26]  Frank Bentley,et al.  Understanding the Long-Term Use of Smart Speaker Assistants , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[27]  Mark Johnson,et al.  Disfluency Detection using a Noisy Channel Model and a Deep Neural Language Model , 2017, ACL.

[28]  Nancy A. Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[29]  Jichen Zhu,et al.  Patterns for How Users Overcome Obstacles in Voice User Interfaces , 2018, CHI.

[30]  Yawen Guo ImprovChat: An AI-enabled Dialogue Assistant Chatbot for English Language Learners (ELL) , 2018 .

[31]  Wanxiang Che,et al.  A Neural Attention Model for Disfluency Detection , 2016, COLING.

[32]  Shintaro Izumi,et al.  Implementing Virtual Agent as an Interface for Smart Home Voice Control , 2012, 2012 19th Asia-Pacific Software Engineering Conference.

[33]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[34]  Michael L. Mauldin,et al.  CHATTERBOTS, TINYMUDS, and the Turing Test: Entering the Loebner Prize Competition , 1994, AAAI.

[35]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[36]  James F. Allen,et al.  Deyecting and Correcting Speech Repairs , 1994, ACL.

[37]  Elisabeth Schriberg,et al.  Preliminaries to a Theory of Speech Disfluencies , 1994 .

[38]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[39]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).