Evaluating Variable-Length Multiple-Option Lists in Chatbots and Mobile Search

In recent years, the proliferation of smart mobile devices has lead to the gradual integration of search functionality within mobile platforms. This has created an incentive to move away from the "ten blue links" metaphor, as mobile users are less likely to click on them, expecting to get the answer directly from the snippets. In turn, this has revived the interest in Question Answering. Then, along came chatbots, conversational systems, and messaging platforms, where the user needs could be better served with the system asking follow-up questions in order to better understand the user's intent. While typically a user would expect a single response at any utterance, a system could also return multiple options for the user to select from, based on different system understandings of the user's intent. However, this possibility should not be overused, as this practice could confuse and/or annoy the user. How to produce good variable-length lists, given the conflicting objectives of staying short while maximizing the likelihood of having a correct answer included in the list, is an underexplored problem. It is also unclear how to evaluate a system that tries to do that. Here we aim to bridge this gap. In particular, we define some necessary and some optional properties that an evaluation measure fit for this purpose should have. We further show that existing evaluation measures from the IR tradition are not entirely suitable for this setup, and we propose novel evaluation measures that address it satisfactorily.

[1]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR Forum.

[2]  Charles L. A. Clarke,et al.  Modeling user variance in time-biased gain , 2012, HCIR '12.

[3]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[4]  Anselmo Peñas,et al.  A Simple Measure to Assess Non-response , 2011, ACL.

[5]  Zahra Ashktorab,et al.  Resilient Chatbots: Repair Strategy Preferences for Conversational Breakdowns , 2019, CHI.

[6]  Timothy Baldwin,et al.  Quit While Ahead: Evaluating Truncated Rankings , 2016, SIGIR.

[7]  Fabrizio Sebastiani,et al.  An Axiomatically Derived Measure for the Evaluation of Classification Algorithms , 2015, ICTIR.

[8]  Tony Russell-Rose,et al.  Designing the search experience - the information architecture of discovery , 2012 .

[9]  Tetsuya Sakai,et al.  Summaries, ranked retrieval and sessions: a unified framework for information access evaluation , 2013, SIGIR.

[10]  Alistair Moffat,et al.  Seven Numeric Properties of Effectiveness Metrics , 2013, AIRS.

[11]  Enrique Amigó,et al.  An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric , 2018, SIGIR.

[12]  Marco Aurélio Gerosa,et al.  How Should My Chatbot Interact? A Survey on Social Characteristics in Human–Chatbot Interaction Design , 2019, Int. J. Hum. Comput. Interact..

[13]  Francesco Caltagirone,et al.  Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[14]  Tetsuya Sakai,et al.  New Performance Metrics Based on Multigrade Relevance: Their Application to Question Answering , 2004, NTCIR.

[15]  Paul Thomas,et al.  Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure , 2018, SIGIR.

[16]  Geoffrey Zweig,et al.  Fast and easy language understanding for dialog systems with Microsoft Language Understanding Intelligent Service (LUIS) , 2015, SIGDIAL Conference.

[17]  Nick Pawlowski,et al.  Rasa: Open Source Language Understanding and Dialogue Management , 2017, ArXiv.

[18]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[19]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[20]  Alistair Moffat,et al.  Desirable Properties for Diversity and Truncated Effectiveness Metrics , 2018, ADCS.

[21]  Ruhi Sarikaya,et al.  Exploiting shared information for multi-intent natural language sentence classification , 2013, INTERSPEECH.

[22]  Asbjørn Følstad,et al.  Chatbots and the new world of HCI , 2017, Interactions.

[23]  Andrei Z. Broder,et al.  The New Frontier of Web Search Technology: Seven Challenges , 2010, SeCO Workshop.

[24]  Varvara Logacheva,et al.  DeepPavlov: Open-Source Library for Dialogue Systems , 2018, ACL.

[25]  Adrian Hernandez-Mendez,et al.  Evaluating Natural Language Understanding Services for Conversational Question Answering Systems , 2017, SIGDIAL Conference.