Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

User engagement is a critical metric for evaluating the quality of open-domain dialogue systems. Prior work has focused on conversation-level engagement by using heuristically constructed features such as the number of turns and the total time of the conversation. In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, {\em predictive engagement}, for automatic evaluation of open-domain dialogue systems. Our experiments demonstrate that (1) human annotators have high agreement on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance-level engagement scores. Furthermore, we show that the utterance-level engagement scores can be learned from data. These scores can improve automatic evaluation metrics for open-domain dialogue systems, as shown by correlation with human judgements. This suggests that predictive engagement can be used as a real-time feedback for training better dialogue models.

[1]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[2]  Allison Woodruff,et al.  Detecting user engagement in everyday conversations , 2004, INTERSPEECH.

[3]  Dilek Z. Hakkani-Tür,et al.  Towards Coherent and Engaging Spoken Dialog Response Generation Using Automatic Conversation Evaluators , 2019, INLG.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Xiaojuan Ma,et al.  Towards Human-Engaged AI , 2018, IJCAI.

[6]  Jason Weston,et al.  Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[7]  Ashwin Ram,et al.  Alexa Prize - State of the Art in Conversational AI , 2018, AI Mag..

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[10]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[11]  Oriol Vinyals,et al.  Adversarial Evaluation of Dialogue Models , 2017, ArXiv.

[12]  Jason Weston,et al.  Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[13]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[14]  Angeliki Metallinou,et al.  Topic-based Evaluation for Conversational Bots , 2018, ArXiv.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Dongyan Zhao,et al.  Learning to Converse with Noisy Data: Generation with Calibration , 2018, IJCAI.

[17]  Helen Hastie,et al.  Metrics and Evaluation of Spoken Dialogue Systems , 2012 .

[18]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[19]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[20]  Alexander I. Rudnicky,et al.  A Wizard-of-Oz Study on A Non-Task-Oriented Dialog Systems That Reacts to User Engagement , 2016, SIGDIAL Conference.

[21]  Birk Diedenhofen,et al.  cocor: A Comprehensive Solution for the Statistical Comparison of Correlations , 2015, PloS one.

[22]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[23]  Rahul Goel,et al.  On Evaluating and Comparing Open Domain Dialog Systems , 2018 .

[24]  Eric Horvitz,et al.  Learning to Predict Engagement with a Spoken Dialog System in Open-World Settings , 2009, SIGDIAL Conference.

[25]  Jason Weston,et al.  What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.

[26]  Varvara Logacheva,et al.  ConvAI Dataset of Topic-Oriented Human-to-Chatbot Dialogues , 2018 .

[27]  Nanyun Peng,et al.  Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[28]  Natasha Jaques,et al.  Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems , 2019, NeurIPS.

[29]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[30]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[31]  Dongyan Zhao,et al.  RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.

[32]  Alexander I. Rudnicky,et al.  A Dataset of Topic-Oriented Human-to-Chatbot Dialogues , 2018 .

[33]  Tatsuya Kawahara,et al.  Engagement Recognition in Spoken Dialogue via Neural Network by Aggregating Different Annotators' Models , 2018, INTERSPEECH.