论文信息 - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation - 字舞流文

Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation

There is a growing interest in developing goal-oriented dialog systems which serve users in accomplishing complex tasks through multi-turn conversations. Although many methods are devised to evaluate and improve the performance of individual dialog components, there is a lack of comprehensive empirical study on how different components contribute to the overall performance of a dialog system. In this paper, we perform a system-wise evaluation and present an empirical analysis on different types of dialog systems which are composed of different modules in different settings. Our results show that (1) a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels, (2) component-wise, single-turn evaluation results are not always consistent with the overall performance of a dialog system, and (3) despite the discrepancy between simulators and human users, simulated evaluation is still a valid alternative to the costly human evaluation especially in the early stage of development.

Jianfeng Gao | Minlie Huang | Ryuichi Takanobu | Qi Zhu | Baolin Peng | Jinchao Li | Jianfeng Gao | Minlie Huang | Baolin Peng | Ryuichi Takanobu | Jinchao Li | Qi Zhu

[1] Dilek Z. Hakkani-Tür,et al. Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems , 2018, NAACL.

[2] Minlie Huang,et al. Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog , 2019, EMNLP.

[3] Roberto Pieraccini,et al. User modeling for spoken dialogue system evaluation , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[4] Richard Socher,et al. Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems , 2019, ACL.

[5] Jianfeng Gao,et al. ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems , 2020, ACL.

[6] Zheng Zhang,et al. Recent advances and challenges in task-oriented dialog systems , 2020, Science China Technological Sciences.

[7] Maxine Eskénazi,et al. Structured Fusion Networks for Dialog , 2019, SIGdial.

[8] David Vandyke,et al. PyDial: A Multi-domain Statistical Dialogue System Toolkit , 2017, ACL.

[9] Peng Zhou,et al. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme , 2017, ACL.

[10] Mihail Eric,et al. MultiWOZ 2. , 2019 .

[11] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12] Dilek Z. Hakkani-Tür,et al. MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines , 2019, ArXiv.

[13] David Vandyke,et al. On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems , 2016, ACL.

[14] Nick Pawlowski,et al. Rasa: Open Source Language Understanding and Dialogue Management , 2017, ArXiv.

[15] Min-Yen Kan,et al. Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures , 2018, ACL.

[16] Lihong Li,et al. Neural Approaches to Conversational AI , 2019, Found. Trends Inf. Retr..

[17] Kartikeya Upasani,et al. Constrained Decoding for Neural NLG from Compositional Representations in Task-Oriented Dialogue , 2019, ACL.

[18] Jianfeng Gao,et al. Results of the Multi-Domain Task-Completion Dialog Challenge , 2020, AAAI 2020.

[19] Andreas Stolcke,et al. Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[20] Tim Paek. Empirical Methods for Evaluating Dialog Systems , 2001, SIGDIAL Workshop.

[21] Sungjin Lee,et al. ConvLab: Multi-Domain End-to-End Dialog System Platform , 2019, ACL.

[22] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23] Anoop Cherian,et al. The Eighth Dialog System Technology Challenge , 2019, ArXiv.

[24] Chih-Li Huo,et al. Slot-Gated Modeling for Joint Slot Filling and Intent Prediction , 2018, NAACL.

[25] Matthew Henderson,et al. The Second Dialog State Tracking Challenge , 2014, SIGDIAL Conference.

[26] Ivan Vulić,et al. Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems , 2019, EMNLP.

[27] Pascale Fung,et al. Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems , 2018, ACL.

[28] Marilyn A. Walker,et al. PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[29] Kam-Fai Wong,et al. Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[30] Jianmo Ni,et al. Scalable and Accurate Dialogue State Tracking via Hierarchical Sequence Generation , 2019, EMNLP.

[31] Dilek Z. Hakkani-Tür,et al. Dialog State Tracking: A Neural Reading Comprehension Approach , 2019, SIGdial.

[32] Gökhan Tür,et al. Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM , 2016, INTERSPEECH.

[33] Pawel Budzianowski,et al. Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing , 2018, ACL.

[34] Sungjin Lee,et al. Task Lineages: Dialog State Tracking for Flexible Interaction , 2016, SIGDIAL Conference.

[35] Minlie Huang,et al. Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition , 2020, ACL.

[36] Stefan Ultes,et al. Interaction Quality: Assessing the quality of ongoing spoken dialog interaction by experts - And how it relates to user satisfaction , 2015, Speech Commun..

[37] Ryuichiro Higashinaka,et al. Issues in Predicting User Satisfaction Transitions in Dialogues: Individual Differences, Evaluation Criteria, and Prediction Models , 2010, IWSDS.

[38] Maxine Eskénazi,et al. Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models , 2019, NAACL.

[39] David Vandyke,et al. Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[40] Lu Chen,et al. Recurrent Polynomial Network for Dialogue State Tracking with Mismatched Semantic Parsers , 2015, SIGDIAL Conference.

[41] Hui Ye,et al. Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[42] Geoffrey Zweig,et al. Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning , 2017, ACL.

[43] Jianfeng Gao,et al. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[44] Masahiro Araki,et al. Automatic Evaluation Environment for Spoken Dialogue Systems , 1996, ECAI Workshop on Dialogue Processing in Spoken Language Systems.

[45] Markus Dreyer,et al. Multi-Task Networks with Universe, Group, and Task Feature Learning , 2019, ACL.

[46] Wenhu Chen,et al. Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention , 2019, ACL.

[47] David Vandyke,et al. A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[48] Yangming Li,et al. Entity-Consistent End-to-end Task-Oriented Dialogue System with KB Retriever , 2019, EMNLP.

[49] Gokhan Tur,et al. Plato Dialogue System: A Flexible Conversational AI Research Platform , 2020, ArXiv.

[50] Tae-Yoon Kim,et al. SUMBT: Slot-Utterance Matching for Universal and Scalable Belief Tracking , 2019, ACL.

[51] Wolfgang Minker,et al. On Quality Ratings for Spoken Dialogue Systems – Experts vs. Users , 2013, NAACL.

[52] Zhijian Ou,et al. Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context , 2019, AAAI.