How Was Your Day? Evaluating a Conversational Companion

The "How Was Your Day" (HWYD) companion is an embodied conversational agent that can discuss work-related issues, entering free-form dialogues while discussing issues surrounding a typical work day. The open-ended nature of these interactions requires new models of evaluation. Here, we describe a paradigm and methodology for evaluating the main aspects of such functionality in conjunction with overall system behavior, with respect to three parameters: functional ability (i.e., does it do the "rightâ thing conversationally), content (i.e., does it respond appropriately to the semantic context), and emotional behavior (i.e., given the emotional input from the user, does it respond in an emotionally appropriate way). We demonstrate the functionality of our evaluation paradigm as a method for both grading current system performance, and targeting areas for particular performance review. We show correlation between, for example, automatic speech recognition performance and overall system performance (as is expected in systems of this type), but beyond this, we show where individual utterances or responses, indicated as positive or negative, characterize system performance, and demonstrate how our combination evaluation approach highlights issues (both positive and negative) in the companion system's interaction behavior.

[1]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[2]  Arne Jönsson,et al.  Distilling dialogues - A method using natural dialogue corpora for dialogue systems development , 2000, ANLP.

[3]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[4]  Gregory A. Sanders,et al.  Darpa Communicator Evaluation: Progress from 2000 to 2001 Darpa Communicator Evaluation: Progress from 2000 to 2001 , 2022 .

[5]  Timothy W. Bickmore,et al.  Establishing and maintaining long-term human-computer relationships , 2005, TCHI.

[6]  Stephen G. Pulman,et al.  Multi-entity Sentiment Scoring , 2009, RANLP.

[7]  Preben Hansen,et al.  Evaluating Human-Machine Conversation for Appropriateness , 2010, LREC.

[8]  Wolfgang Minker,et al.  Evaluation methodologies for interactive speech system , 1998, LREC.

[9]  Oli Mival,et al.  Landscaping personification technologies: from interactions to relationships , 2008, CHI Extended Abstracts.

[10]  Wolfgang Minker,et al.  Endowing Spoken Language Dialogue Systems with Emotional Intelligence , 2004, ADS.

[11]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[12]  J. Weizenbaum From Computer Power and Human Reason From Judgment to Calculation , 2007 .

[13]  Marc Cavazza,et al.  Interaction Strategies for an Affective Conversational Agent , 2010, PRESENCE: Teleoperators and Virtual Environments.

[14]  Melita Hajdinjak,et al.  The PARADISE Evaluation Framework: Issues and Findings , 2006, Computational Linguistics.

[15]  Elisabeth André,et al.  EmoVoice - A Framework for Online Recognition of Emotions from Voice , 2008, PIT.

[16]  J. Weizenbaum Computer Power And Human Reason: From Judgement To Calculation , 1978 .

[17]  Markku Turunen,et al.  How was your day? An architecture for multimodal ECA systems , 2010, SIGDIAL Conference.

[18]  David R. Traum,et al.  Evaluation of Multi-party Virtual Reality Dialogue Interaction , 2004, LREC.

[19]  Etienne de Sevin,et al.  GRETA: Towards an interactive conversational virtual Companion , 2010 .

[20]  Stephen Pulman Handling User Interruptions in an Embodied Conversational Agent , 2010, AAMAS 2010.

[21]  A. Nijholt Conversational Agents and the Construction of Humorous Acts , 2007 .

[22]  Y. Wilks,et al.  Book Review: Close Engagements with Artificial Companions: Key Social, Psychological, Ethical, and Design Issues edited by Yorick Wilks , 2010, CL.

[23]  Andrew C. Simpson,et al.  Black box and glass box evaluation of the SUNDIAL system , 1993, EUROSPEECH.

[24]  Morena Danieli,et al.  Managing dialogue in a continuous speech understanding system , 1993, EUROSPEECH.

[25]  Lynette Hirschman,et al.  Overview of evaluation in speech and natural language processing , 1997 .

[26]  Yorick Wilks,et al.  Is There Progress on Talking Sensibly to Machines? , 2007, Science.

[27]  Ana Paiva,et al.  LEARNING BY FEELING: EVOKING EMPATHY WITH SYNTHETIC CHARACTERS , 2005, Appl. Artif. Intell..

[28]  Rosalind W. Picard Affective Computing , 1997 .

[29]  W. Minker Evaluation Methodologies for Interactive Speech Systems , 1998 .

[30]  R. Cole,et al.  Survey of the State of the Art in Human Language Technology , 2010 .

[31]  Morena Danieli,et al.  Metrics for Evaluating Dialogue Strategies in a Spoken Language System , 1996, ArXiv.