Predicting User Satisfaction with Intelligent Assistants

There is a rapid growth in the use of voice-controlled intelligent personal assistants on mobile devices, such as Microsoft's Cortana, Google Now, and Apple's Siri. They significantly change the way users interact with search systems, not only because of the voice control use and touch gestures, but also due to the dialogue-style nature of the interactions and their ability to preserve context across different queries. Predicting success and failure of such search dialogues is a new problem, and an important one for evaluating and further improving intelligent assistants. While clicks in web search have been extensively used to infer user satisfaction, their significance in search dialogues is lower due to the partial replacement of clicks with voice control, direct and voice answers, and touch gestures. In this paper, we propose an automatic method to predict user satisfaction with intelligent assistants that exploits all the interaction signals, including voice commands and physical touch gestures on the device. First, we conduct an extensive user study to measure user satisfaction with intelligent assistants, and simultaneously record all user interactions. Second, we show that the dialogue style of interaction makes it necessary to evaluate the user experience at the overall task level as opposed to the query level. Third, we train a model to predict user satisfaction, and find that interaction signals that capture the user reading patterns have a high impact: when including all available interaction signals, we are able to improve the prediction accuracy of user satisfaction from 71% to 81% over a baseline that utilizes only click and query features.

[1]  Eugene Agichtein,et al.  Mining touch interaction data on mobile devices to predict web search result relevance , 2013, SIGIR.

[2]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[3]  Gokhan Tur,et al.  TechWare: Spoken Language Understanding (SLU) Resources , 2013 .

[4]  Gökhan Tür,et al.  Multi-Modal Conversational Search and Browse , 2013, SLAM@INTERSPEECH.

[5]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[6]  Imed Zitouni,et al.  Automatic Online Evaluation of Intelligent Assistants , 2015, WWW.

[7]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[8]  Diane Kelly When Effort Exceeds Expectations: A Theory of Search Task Difficulty (keynote) , 2015, SCST@ECIR.

[9]  Lois M. L. Delcambre,et al.  Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions , 2008, ECIR.

[10]  James Allan,et al.  Predicting searcher frustration , 2010, SIGIR.

[11]  Ron Kohavi,et al.  Seven rules of thumb for web site experimenters , 2014, KDD.

[12]  Imed Zitouni,et al.  Understanding User Satisfaction with Intelligent Assistants , 2016, CHIIR.

[13]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[14]  Yang Song,et al.  Context-aware web search abandonment prediction , 2014, SIGIR.

[15]  Mounia Lalmas,et al.  Absence time and user engagement: evaluating ranking functions , 2013, WSDM '13.

[16]  Elaine Toms,et al.  Untangling search task complexity and difficulty in the context of interactive information retrieval studies , 2014, J. Documentation.

[17]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[18]  Eugene Agichtein,et al.  Find it if you can: a game for modeling different types of web search success using interaction data , 2011, SIGIR.

[19]  Michael F. McTear,et al.  Book Review: Spoken Dialogue Technology: Toward the Conversational User Interface, by Michael F. McTear , 2002, CL.

[20]  Gleb Gusev,et al.  Engagement Periodicity in Search Engine Usage: Analysis and its Application to Search Quality Evaluation , 2015, WSDM.

[21]  Gökhan Tür,et al.  Extending boosting for large scale spoken language understanding , 2007, Machine Learning.

[22]  Gökhan Tür,et al.  TechWare: Spoken Language Understanding Resources [Best of the Web] , 2013, IEEE Signal Processing Magazine.

[23]  Gökhan Tür,et al.  Understanding Spoken Language , 2014, Computing Handbook, 3rd ed..

[24]  Jaime Arguello,et al.  Development and Evaluation of Search Tasks for IIR Experiments using a Cognitive Complexity Framework , 2015, ICTIR.

[25]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[26]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..

[27]  Paul B. Kantor,et al.  A Study of Information Seeking and Retrieving. III. Searchers, Searches, and Overlap* , 1988 .

[28]  Paul B. Kantor,et al.  A study of information seeking and retrieving. I. background and methodology , 1988 .

[29]  Madian Khabsa,et al.  Detecting Good Abandonment in Mobile Search , 2016, WWW.

[30]  Chih-Hung Hsieh,et al.  Towards better measurement of attention and satisfaction in mobile search , 2014, SIGIR.

[31]  Ricardo Baeza-Yates,et al.  Online multitasking and user engagement , 2013, CIKM.

[32]  Paul B. Kantor,et al.  A study of information seeking and retrieving. I. Background and methodology , 1997, J. Am. Soc. Inf. Sci..

[33]  Wolfgang Wahlster,et al.  SmartKom: Foundations of Multimodal Dialogue Systems , 2006, SmartKom.

[34]  José Guilherme Camargo de Souza,et al.  Quality Estimation for Automatic Speech Recognition , 2014, COLING.

[35]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..

[36]  Ahmed Hassan Awadallah,et al.  Beyond DCG: user behavior as a predictor of a successful search , 2010, WSDM '10.

[37]  Jeff Huang Web User Interaction Mining from Touch-Enabled Mobile Devices , 2012 .

[38]  Jane Li,et al.  Good abandonment in mobile and PC internet search , 2009, SIGIR.

[39]  Pia Borlund,et al.  The IIR evaluation model: a framework for evaluation of interactive information retrieval systems , 2003, Inf. Res..

[40]  Milad Shokouhi,et al.  From Queries to Cards: Re-ranking Proactive Card Recommendations Based on Reactive Search History , 2015, SIGIR.

[41]  Wolfgang Wahlster,et al.  SmartKom: Foundations of Multimodal Dialogue Systems (Cognitive Technologies) , 2006 .

[42]  Nick Craswell,et al.  Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.

[43]  Gilad Mishne,et al.  Towards recency ranking in web search , 2010, WSDM '10.

[44]  D. Campbell Task Complexity: A Review and Analysis , 1988 .

[45]  Paul B. Kantor,et al.  A study of information seeking and retrieving. II. Users, questions, and effectiveness , 1988 .

[46]  Eric Crestan,et al.  Modelling and Detecting Changes in User Satisfaction , 2014, CIKM.

[47]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[48]  Yang Song,et al.  Evaluating and predicting user engagement change with degraded search relevance , 2013, WWW.

[49]  Filip Radlinski,et al.  Relevance and Effort: An Analysis of Document Utility , 2014, CIKM.

[50]  Eugene Agichtein,et al.  Towards estimating web search result relevance from touch interactions on mobile devices , 2013, CHI Extended Abstracts.

[51]  Yu Guo,et al.  Statistical inference in two-stage online controlled experiments with treatment selection and validation , 2014, WWW.

[52]  Gleb Gusev,et al.  Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments , 2015, WWW.

[53]  Dean Eckles,et al.  Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods , 2013, KDD.

[54]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[55]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[56]  Jaap Kamps,et al.  Behavioral Dynamics from the SERP's Perspective: What are Failed SERPs and How to Fix Them? , 2015, CIKM.

[57]  Ryen W. White,et al.  Struggling or exploring?: disambiguating long search sessions , 2014, WSDM.