Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems

Task-oriented dialogue systems (TDSs) are assessed mainly in an offline setting or through human evaluation. The evaluation is often limited to single-turn or very time-intensive. As an alternative, user simulators that mimic user behavior allow us to consider a broad set of user goals to generate human-like conversations for simulated evaluation. Employing existing user simulators to evaluate TDSs is challenging as user simulators are primarily designed to optimize dialogue policies for TDSs and have limited evaluation capability. Moreover, the evaluation of user simulators is an open challenge. In this work, we proposes a metaphorical user simulator for end-to-end TDS evaluation. We also propose a tester-based evaluation framework to generate variants, i.e., dialogue systems with different capabilities. Our user simulator constructs a metaphorical user model that assists the simulator in reasoning by referring to prior knowledge when encountering new items. We estimate the quality of simulators by checking the simulated interactions between simulators and variants. Our experiments are conducted using three TDS datasets. The metaphorical user simulator demonstrates better consistency with manual evaluation than Agenda-based simulator and Seq2seq model on three datasets; our tester framework demonstrates efficiency, and our approach demonstrates better gen-eralization and scalability.

[1]  K. Balog,et al.  Analyzing and Simulating User Utterance Reformulation in Conversational Recommender Systems , 2022, SIGIR.

[2]  Jason Weston,et al.  Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents , 2022, NLP4CONVAI.

[3]  Yinhe Zheng,et al.  GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-Supervised Learning and Explicit Policy Injection , 2021, AAAI.

[4]  Elman Mansimov,et al.  Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System , 2021, ACL.

[5]  Yuhang Guo,et al.  Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese , 2021, ArXiv.

[6]  Baolin Peng,et al.  Soloist: Building Task Bots at Scale with Transfer Learning and Machine Teaching , 2021, Transactions of the Association for Computational Linguistics.

[7]  Bill Byrne,et al.  Transferable Dialogue Systems and User Simulators , 2021, ACL.

[8]  Paul Thomas,et al.  Sim4IR: The SIGIR 2021 Workshop on Simulation for Information Retrieval Evaluation , 2021, SIGIR.

[9]  Chengxiang Zhai,et al.  An Exploration of Tester-based Evaluation of User Simulators for Comparing Interactive Retrieval Systems. , 2021, SIGIR.

[10]  Ondrej Dusek,et al.  Shades of BLEU, Flavours of Success: The Case of MultiWOZ , 2021, GEM.

[11]  M. de Rijke,et al.  Wizard of Search Engine: Access to Information Through Conversations with Search Engines , 2021, SIGIR.

[12]  Zhou Yu,et al.  Leveraging Slot Descriptions for Zero-Shot Cross-Domain Dialogue StateTracking , 2021, NAACL.

[13]  M. de Rijke,et al.  Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems , 2021, SIGIR.

[14]  M. de Rijke,et al.  Advances and Challenges in Conversational Recommender Systems: A Survey , 2021, AI Open.

[15]  Xiaojun Quan,et al.  UBAR: Towards Fully End-to-End Task-Oriented Dialog Systems with GPT-2 , 2020, AAAI.

[16]  Minlie Huang,et al.  CR-Walker: Tree-Structured Graph Reasoning and Dialog Acts for Conversational Recommendation , 2020, EMNLP.

[17]  Minlie Huang,et al.  MultiWOZ 2.3: A Multi-domain Task-Oriented Dialogue Dataset Enhanced with Annotation Corrections and Co-Reference Annotation , 2020, NLPCC.

[18]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[19]  K. Balog Conversational AI from an Information Retrieval Perspective: Remaining Challenges and a Case for User Simulation , 2021, DESIRES.

[20]  M. de Rijke,et al.  Keeping Dataset Biases out of the Simulation: A Debiased Simulator for Reinforcement Learning based Recommender Systems , 2020, RecSys.

[21]  Andrea Madotto,et al.  Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems , 2020, ArXiv.

[22]  M. de Rijke,et al.  Conversational Recommendation: Formulation, Methods, and Evaluation , 2020, SIGIR.

[23]  Yulong Gu,et al.  Neural Interactive Collaborative Filtering , 2020, SIGIR.

[24]  Elizabeth Clark,et al.  Evaluation of Text Generation: A Survey , 2020, ArXiv.

[25]  Krisztian Balog,et al.  Evaluating Conversational Recommender Systems via User Simulation , 2020, KDD.

[26]  R. Socher,et al.  A Simple Language Model for Task-Oriented Dialogue , 2020, Neural Information Processing Systems.

[27]  Zheng Zhang,et al.  Recent advances and challenges in task-oriented dialog systems , 2020, Science China Technological Sciences.

[28]  Jimmy J. Lin,et al.  Document Ranking with a Pretrained Sequence-to-Sequence Model , 2020, FINDINGS.

[29]  Xiangnan He,et al.  Estimation-Action-Reflection: Towards Deep Interaction Between Conversational and Recommender Systems , 2020, WSDM.

[30]  Xiaodong He,et al.  The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service , 2019, LREC.

[31]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[32]  Zhou Yu,et al.  MOSS: End-to-End Dialog System Framework with Modular Supervision , 2019, AAAI.

[33]  Anuj Kumar Goyal,et al.  MultiWOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines , 2019, LREC.

[34]  Arantxa Otegi,et al.  Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.

[35]  Zhou Yu,et al.  How to Build User Simulators to Train RL-based Dialog Systems , 2019, EMNLP.

[36]  Gökhan Tür,et al.  Collaborative Multi-Agent Dialogue Model Training Via Reinforcement Learning , 2019, SIGdial.

[37]  Nava Tintarev,et al.  SIREN: A Simulation Framework for Understanding the Effects of Recommender Systems in Online News Environments , 2019, FAT.

[38]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[39]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[40]  Xu Chen,et al.  Towards Conversational Search and Recommendation: System Ask, User Respond , 2018, CIKM.

[41]  Zhaochun Ren,et al.  Explicit State Tracking with Semi-Supervisionfor Neural Dialogue Generation , 2018, CIKM.

[42]  Min-Yen Kan,et al.  Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures , 2018, ACL.

[43]  Yi Zhang,et al.  Conversational Recommender System , 2018, SIGIR.

[44]  Yinan Zhang,et al.  Information Retrieval Evaluation as Search Simulation: A General Formal Framework for IR Evaluation , 2017, ICTIR.

[45]  Tsung-Hsien Wen,et al.  Neural Belief Tracker: Data-Driven Dialogue State Tracking , 2016, ACL.

[46]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[47]  David Maxwell,et al.  Agents, Simulated Users and Humans: An Analysis of Performance and Behaviour , 2016, CIKM.

[48]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[49]  Homa B. Hashemi,et al.  Query Intent Detection using Convolutional Neural Networks , 2016 .

[50]  David Vandyke,et al.  Multi-domain Dialog State Tracking using Recurrent Neural Networks , 2015, ACL.

[51]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[52]  Helen F. Hastie,et al.  A survey on metrics for the evaluation of user simulations , 2012, The Knowledge Engineering Review.

[53]  A. Kaal Metaphor in conversation , 2012 .

[54]  Milica Gasic,et al.  Real User Evaluation of Spoken Dialogue Systems Using Amazon Mechanical Turk , 2011, INTERSPEECH.

[55]  Ben Carterette,et al.  Simulating simple user behavior for system effectiveness evaluation , 2011, CIKM '11.

[56]  Maxine Eskénazi,et al.  Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results , 2011, SIGDIAL Conference.

[57]  Anne Leitch,et al.  Mental models: an interdisciplinary synthesis of theory and methods , 2011 .

[58]  Mattias Heldner,et al.  Towards human-like spoken dialogue systems , 2008, Speech Commun..

[59]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[60]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[61]  Kallirroi Georgila,et al.  Learning user simulations for information state update dialogue systems , 2005, INTERSPEECH.

[62]  H. Cuayahuitl,et al.  Human-computer dialogue simulation using hidden Markov models , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[63]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[64]  Lori Lamel,et al.  The LIMSI ARISE system , 2000, Speech Commun..

[65]  Roberto Pieraccini,et al.  User modeling for spoken dialogue system evaluation , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[66]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.