Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems

Evaluation is crucial in the development process of task-oriented dialogue systems. As an evaluation method, user simulation allows us to tackle issues such as scalability and cost-efficiency, making it a viable choice for large-scale automatic evaluation. To help build a human-like user simulator that can measure the quality of a dialogue, we propose the following task: simulating user satisfaction for the evaluation of task-oriented dialogue systems. The purpose of the task is to increase the evaluation power of user simulations and to make the simulation more human-like. To overcome a lack of annotated data, we propose a user satisfaction annotation dataset, USS, that includes 6,800 dialogues sampled from multiple domains, spanning real-world e-commerce dialogues, task-oriented dialogues constructed through Wizard-of-Oz experiments, and movie recommendation dialogues. All user utterances in those dialogues, as well as the dialogues themselves, have been labeled based on a 5-level satisfaction scale. We also share three baseline methods for user satisfaction prediction and action prediction tasks. Experiments conducted on the USS dataset suggest that distributed representations outperform feature-based methods. A model based on hierarchical GRUs achieves the best performance in in-domain user satisfaction prediction, while a BERT-based model has better cross-domain generalization ability.

[1]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[2]  Kazuya Takeda,et al.  Estimation Method of User Satisfaction Using N-gram-based Dialog History Model for Spoken Dialog System , 2010, LREC.

[3]  Zhaochun Ren,et al.  Hierarchical Variational Memory Network for Dialogue Generation , 2018, WWW.

[4]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[5]  Milica Gasic,et al.  Real User Evaluation of Spoken Dialogue Systems Using Amazon Mechanical Turk , 2011, INTERSPEECH.

[6]  Meng Chen,et al.  The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service , 2020, LREC.

[7]  Zheng-Yu Niu,et al.  Conversational Graph Grounded Policy Learning for Open-Domain Conversation Generation , 2020, ACL.

[8]  Xiaozhong Liu,et al.  Time to Transfer: Predicting and Evaluating Machine-Human Chatting Handoff , 2020, AAAI.

[9]  Wolfgang Minker,et al.  Recurrent Neural Network Interaction Quality Estimation , 2016, IWSDS.

[10]  Homa B. Hashemi,et al.  Query Intent Detection using Convolutional Neural Networks , 2016 .

[11]  Wolfgang Minker,et al.  On Quality Ratings for Spoken Dialogue Systems – Experts vs. Users , 2013, NAACL.

[12]  Sebastian Möller,et al.  Modeling User Satisfaction with Hidden Markov Models , 2009, SIGDIAL Conference.

[13]  Kallirroi Georgila,et al.  Learning user simulations for information state update dialogue systems , 2005, INTERSPEECH.

[14]  Stefan Ultes,et al.  Interaction Quality: Assessing the quality of ongoing spoken dialog interaction by experts - And how it relates to user satisfaction , 2015, Speech Commun..

[15]  H. Cuayahuitl,et al.  Human-computer dialogue simulation using hidden Markov models , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[16]  Zhoujun Li,et al.  Building Task-Oriented Dialogue Systems for Online Shopping , 2017, AAAI.

[17]  Krisztian Balog,et al.  Evaluating Conversational Recommender Systems via User Simulation , 2020, KDD.

[18]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[19]  Jing He,et al.  A Sequence-to-Sequence Model for User Simulation in Spoken Dialogue Systems , 2016, INTERSPEECH.

[20]  Mihail Eric,et al.  MultiWOZ 2. , 2019 .

[21]  Zhaochun Ren,et al.  Explicit State Tracking with Semi-Supervisionfor Neural Dialogue Generation , 2018, CIKM.

[22]  Lina M. Rojas Barahona Is the User Enjoying the Conversation? A Case Study on the Impact on the Reward Function , 2021, ArXiv.

[23]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[24]  Weinan Zhang,et al.  A Compare Aggregate Transformer for Understanding Document-grounded Dialogue , 2020, FINDINGS.

[25]  Ryuichiro Higashinaka,et al.  Issues in Predicting User Satisfaction Transitions in Dialogues: Individual Differences, Evaluation Criteria, and Prediction Models , 2010, IWSDS.

[26]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[27]  Iñigo Casanueva,et al.  Neural User Simulation for Corpus-based Policy Optimisation of Spoken Dialogue Systems , 2018, SIGDIAL Conference.

[28]  Luísa Coheur,et al.  Luke, I am Your Father: Dealing with Out-of-Domain Requests by Using Movies Subtitles , 2014, IVA.

[29]  M. de Rijke,et al.  Conversations Powered by Cross-Lingual Knowledge , 2021, SIGIR.

[30]  David Vandyke,et al.  Multi-domain Dialog State Tracking using Recurrent Neural Networks , 2015, ACL.

[31]  Arantxa Otegi,et al.  Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.

[32]  Lazaros Polymenakos,et al.  Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation , 2019, ArXiv.

[33]  Michael R. Lyu,et al.  HiGRU: Hierarchical Gated Recurrent Units for Utterance-Level Emotion Recognition , 2019, NAACL.

[34]  Maxine Eskénazi,et al.  Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results , 2011, SIGDIAL Conference.

[35]  Roberto Pieraccini,et al.  User modeling for spoken dialogue system evaluation , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[36]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[37]  Lazaros Polymenakos,et al.  Joint Turn and Dialogue level User Satisfaction Estimation on Mulit-Domain Conversations , 2020, FINDINGS.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[40]  Xiangnan He,et al.  Interactive Path Reasoning on Graph for Conversational Recommendation , 2020, KDD.

[41]  Wolfgang Minker,et al.  A Parameterized and Annotated Spoken Dialog Corpus of the CMU Let’s Go Bus Information System , 2012, LREC.

[42]  Li Chen,et al.  Predicting User Intents and Satisfaction with Dialogue-based Conversational Recommendations , 2020, UMAP.

[43]  Raghav Gupta,et al.  Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset , 2020, AAAI.

[44]  Haizhou Li,et al.  IRIS: a Chat-oriented Dialogue System based on the Vector Space Model , 2012, ACL.

[45]  Min-Yen Kan,et al.  Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures , 2018, ACL.

[46]  Xiaoyan Zhu,et al.  Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory , 2017, AAAI.

[47]  Jianfeng Gao,et al.  A Persona-Based Neural Conversation Model , 2016, ACL.

[48]  Filip Radlinski,et al.  Coached Conversational Preference Elicitation: A Case Study in Understanding Movie Preferences , 2019, SIGdial.

[49]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[50]  Geoffrey Zweig,et al.  Recurrent neural networks for language understanding , 2013, INTERSPEECH.

[51]  Lori Lamel,et al.  The LIMSI ARISE system , 2000, Speech Commun..

[52]  Tsung-Hsien Wen,et al.  Neural Belief Tracker: Data-Driven Dialogue State Tracking , 2016, ACL.

[53]  M. de Rijke,et al.  DukeNet: A Dual Knowledge Interaction Network for Knowledge-Grounded Conversation , 2020, SIGIR.

[54]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[55]  Oliver Lemon,et al.  Data-Driven Methods for Adaptive Spoken Dialogue Systems , 2012, Springer New York.

[56]  Konrad Scheffler,et al.  Probabilistic simulation of human-machine dialogues , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).