论文信息 - Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems

Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems

Evaluation is crucial in the development process of task-oriented dialogue systems. As an evaluation method, user simulation allows us to tackle issues such as scalability and cost-efficiency, making it a viable choice for large-scale automatic evaluation. To help build a human-like user simulator that can measure the quality of a dialogue, we propose the following task: simulating user satisfaction for the evaluation of task-oriented dialogue systems. The purpose of the task is to increase the evaluation power of user simulations and to make the simulation more human-like. To overcome a lack of annotated data, we propose a user satisfaction annotation dataset, USS, that includes 6,800 dialogues sampled from multiple domains, spanning real-world e-commerce dialogues, task-oriented dialogues constructed through Wizard-of-Oz experiments, and movie recommendation dialogues. All user utterances in those dialogues, as well as the dialogues themselves, have been labeled based on a 5-level satisfaction scale. We also share three baseline methods for user satisfaction prediction and action prediction tasks. Experiments conducted on the USS dataset suggest that distributed representations outperform feature-based methods. A model based on hierarchical GRUs achieves the best performance in in-domain user satisfaction prediction, while a BERT-based model has better cross-domain generalization ability.

[1] Marilyn A. Walker,et al. PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[2] Kazuya Takeda,et al. Estimation Method of User Satisfaction Using N-gram-based Dialog History Model for Spoken Dialog System , 2010, LREC.

[3] Zhaochun Ren,et al. Hierarchical Variational Memory Network for Dialogue Generation , 2018, WWW.

[4] Milica Gasic,et al. POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[5] Milica Gasic,et al. Real User Evaluation of Spoken Dialogue Systems Using Amazon Mechanical Turk , 2011, INTERSPEECH.

[6] Meng Chen,et al. The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service , 2020, LREC.

[7] Zheng-Yu Niu,et al. Conversational Graph Grounded Policy Learning for Open-Domain Conversation Generation , 2020, ACL.

[8] Xiaozhong Liu,et al. Time to Transfer: Predicting and Evaluating Machine-Human Chatting Handoff , 2020, AAAI.

[9] Wolfgang Minker,et al. Recurrent Neural Network Interaction Quality Estimation , 2016, IWSDS.

[10] Homa B. Hashemi,et al. Query Intent Detection using Convolutional Neural Networks , 2016 .

[11] Wolfgang Minker,et al. On Quality Ratings for Spoken Dialogue Systems – Experts vs. Users , 2013, NAACL.

[12] Sebastian Möller,et al. Modeling User Satisfaction with Hidden Markov Models , 2009, SIGDIAL Conference.

[13] Kallirroi Georgila,et al. Learning user simulations for information state update dialogue systems , 2005, INTERSPEECH.

[14] Stefan Ultes,et al. Interaction Quality: Assessing the quality of ongoing spoken dialog interaction by experts - And how it relates to user satisfaction , 2015, Speech Commun..

[15] H. Cuayahuitl,et al. Human-computer dialogue simulation using hidden Markov models , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[16] Zhoujun Li,et al. Building Task-Oriented Dialogue Systems for Online Shopping , 2017, AAAI.

[17] Krisztian Balog,et al. Evaluating Conversational Recommender Systems via User Simulation , 2020, KDD.

[18] Diyi Yang,et al. Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[19] Jing He,et al. A Sequence-to-Sequence Model for User Simulation in Spoken Dialogue Systems , 2016, INTERSPEECH.

[20] Mihail Eric,et al. MultiWOZ 2. , 2019 .

[21] Zhaochun Ren,et al. Explicit State Tracking with Semi-Supervisionfor Neural Dialogue Generation , 2018, CIKM.

[22] Lina M. Rojas Barahona. Is the User Enjoying the Conversation? A Case Study on the Impact on the Reward Function , 2021, ArXiv.

[23] Quoc V. Le,et al. A Neural Conversational Model , 2015, ArXiv.

[24] Weinan Zhang,et al. A Compare Aggregate Transformer for Understanding Document-grounded Dialogue , 2020, FINDINGS.

[25] Ryuichiro Higashinaka,et al. Issues in Predicting User Satisfaction Transitions in Dialogues: Individual Differences, Evaluation Criteria, and Prediction Models , 2010, IWSDS.

[26] Charu C. Aggarwal,et al. Mining Text Data , 2012, Springer US.

[27] Iñigo Casanueva,et al. Neural User Simulation for Corpus-based Policy Optimisation of Spoken Dialogue Systems , 2018, SIGDIAL Conference.

[28] Luísa Coheur,et al. Luke, I am Your Father: Dealing with Out-of-Domain Requests by Using Movies Subtitles , 2014, IVA.

[29] M. de Rijke,et al. Conversations Powered by Cross-Lingual Knowledge , 2021, SIGIR.

[30] David Vandyke,et al. Multi-domain Dialog State Tracking using Recurrent Neural Networks , 2015, ACL.

[31] Arantxa Otegi,et al. Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.

[32] Lazaros Polymenakos,et al. Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation , 2019, ArXiv.

[33] Michael R. Lyu,et al. HiGRU: Hierarchical Gated Recurrent Units for Utterance-Level Emotion Recognition , 2019, NAACL.

[34] Maxine Eskénazi,et al. Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results , 2011, SIGDIAL Conference.

[35] Roberto Pieraccini,et al. User modeling for spoken dialogue system evaluation , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[36] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[37] Lazaros Polymenakos,et al. Joint Turn and Dialogue level User Satisfaction Estimation on Mulit-Domain Conversations , 2020, FINDINGS.

[38] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39] Joelle Pineau,et al. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[40] Xiangnan He,et al. Interactive Path Reasoning on Graph for Conversational Recommendation , 2020, KDD.

[41] Wolfgang Minker,et al. A Parameterized and Annotated Spoken Dialog Corpus of the CMU Let’s Go Bus Information System , 2012, LREC.

[42] Li Chen,et al. Predicting User Intents and Satisfaction with Dialogue-based Conversational Recommendations , 2020, UMAP.

[43] Raghav Gupta,et al. Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset , 2020, AAAI.

[44] Haizhou Li,et al. IRIS: a Chat-oriented Dialogue System based on the Vector Space Model , 2012, ACL.

[45] Min-Yen Kan,et al. Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures , 2018, ACL.

[46] Xiaoyan Zhu,et al. Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory , 2017, AAAI.

[47] Jianfeng Gao,et al. A Persona-Based Neural Conversation Model , 2016, ACL.

[48] Filip Radlinski,et al. Coached Conversational Preference Elicitation: A Case Study in Understanding Movie Preferences , 2019, SIGdial.

[49] Hui Ye,et al. Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[50] Geoffrey Zweig,et al. Recurrent neural networks for language understanding , 2013, INTERSPEECH.

[51] Lori Lamel,et al. The LIMSI ARISE system , 2000, Speech Commun..

[52] Tsung-Hsien Wen,et al. Neural Belief Tracker: Data-Driven Dialogue State Tracking , 2016, ACL.

[53] M. de Rijke,et al. DukeNet: A Dual Knowledge Interaction Network for Knowledge-Grounded Conversation , 2020, SIGIR.

[54] David Vandyke,et al. A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[55] Oliver Lemon,et al. Data-Driven Methods for Adaptive Spoken Dialogue Systems , 2012, Springer New York.

[56] Konrad Scheffler,et al. Probabilistic simulation of human-machine dialogues , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).