A Methodology for Evaluating Interaction Strategies of Task-Oriented Conversational Agents

In task-oriented conversational agents, more attention has been usually devoted to assessing task effectiveness, rather than to how the task is achieved. However, conversational agents are moving towards more complex and human-like interaction capabilities (e.g. the ability to use a formal/informal register, to show an empathetic behavior), for which standard evaluation methodologies may not suffice. In this paper, we provide a novel methodology to assess - in a completely controlled way - the impact on the quality of experience of agent’s interaction strategies. The methodology is based on a within subject design, where two slightly different transcripts of the same interaction with a conversational agent are presented to the user. Through a series of pilot experiments we prove that this methodology allows fast and cheap experimentation/evaluation, focusing on aspects that are overlooked by current methods.

[1]  Judd B. Kessler,et al.  Learning from (failed) replications: Cognitive load manipulations and charitable giving , 2014 .

[2]  Roberto Pieraccini,et al.  A stochastic model of computer-human interaction for learning dialogue strategies , 1997, EUROSPEECH.

[3]  Helen F. Hastie,et al.  A review of evaluation techniques for social dialogue systems , 2017, ISIAA@ICMI.

[4]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[5]  Kristen A. Lindquist,et al.  Opinion TRENDS in Cognitive Sciences Vol.11 No.8 Cognitive-emotional interactions Language as context for the , 2022 .

[6]  Rie Koizumi,et al.  Relationships between text length and lexical diversity measures: Can we use short texts of less than 100 tokens? , 2012 .

[7]  M. Pickering,et al.  Linguistic alignment between people and computers , 2010 .

[8]  Milica Gasic,et al.  Real User Evaluation of Spoken Dialogue Systems Using Amazon Mechanical Turk , 2011, INTERSPEECH.

[9]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[10]  Gabriel Skantze,et al.  Exploring human error recovery strategies: Implications for spoken dialogue systems , 2005, Speech Communication.

[11]  Guillaume Dubuisson Duplessis,et al.  Automatic Measures to Characterise Verbal Alignment in Human-Agent Interaction , 2017, SIGDIAL Conference.

[12]  Rada Mihalcea,et al.  Computational approaches to subjectivity and sentiment analysis: Present and envisaged methods and applications , 2014, Comput. Speech Lang..

[13]  Arne Jönsson,et al.  Subjective and Objective Evaluation of Conversational Agents in Learning Environments for Young Teenagers , 2011 .

[14]  A. Canuto,et al.  Evaluation of Emotional Agents ' Architectures : an Approach Based on Quality Metrics and the Influence of Emotions on Users , 2022 .

[15]  Steve Young,et al.  Automatic learning of dialogue strategy using dialogue simulation and reinforcement learning , 2002 .

[16]  Oliver Lemon,et al.  Learning More Effective Dialogue Strategies Using Limited Dialogue Move Features , 2006, ACL.

[17]  Bo Pang,et al.  The effect of wording on message propagation: Topic- and author-controlled natural experiments on Twitter , 2014, ACL.

[18]  Eric Steven Atwell,et al.  Different measurement metrics to evaluate a chatbot system , 2007, HLT-NAACL 2007.

[19]  Carlo Strapparava,et al.  Ecological Evaluation of Persuasive Messages Using Google AdWords , 2012, ACL.

[20]  Niels Ole Bernsen,et al.  Evaluation and usability of multimodal spoken language dialogue systems , 2004, Speech Commun..

[21]  Ran Zhao,et al.  Cognitive-Inspired Conversational-Strategy Reasoner for Socially-Aware Agents , 2017, IJCAI.

[22]  James F. Allen,et al.  Toward Conversational Human-Computer Interaction , 2001, AI Mag..

[23]  Alexander I. Rudnicky,et al.  Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24]  Marilyn A. Walker,et al.  Data-Driven Dialogue Systems for Social Agents , 2017, IWSDS.

[25]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[26]  Ielka van der Sluis,et al.  Towards Empirical Evaluation of Affective Tactical NLG , 2009, ENLG.

[27]  Arne Jönsson,et al.  Wizard of Oz studies: why and how , 1993, IUI '93.

[28]  Jean-Marc Dewaele,et al.  Formality of Language: definition, measurement and behavioral determinants , 1999 .

[29]  Morgan C. Benton,et al.  Evaluating Quality of Chatbots and Intelligent Conversational Agents , 2017, ArXiv.

[30]  Stefan Riezler,et al.  Statistical Sentence Condensation using Ambiguity Packing and Stochastic Disambiguation Methods for Lexical-Functional Grammar , 2003, NAACL.

[31]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[32]  Enrique Alfonseca,et al.  Prosody Modifications for Question-Answering in Voice-Only Settings , 2018, ArXiv.

[33]  Ramón López-Cózar,et al.  Predicting user mental states in spoken dialogue systems , 2011, EURASIP J. Adv. Signal Process..

[34]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[35]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[36]  Yi Zhu,et al.  Collection of user judgments on spoken dialog system with crowdsourcing , 2010, 2010 IEEE Spoken Language Technology Workshop.

[37]  Tong Wang,et al.  Automatic Acquisition of Lexical Formality , 2010, COLING.