CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Recent neural models that extend the pretrainthen-finetune paradigm continue to achieve new state-of-the-art results on joint goal accuracy (JGA) for dialogue state tracking (DST) benchmarks. However, we call into question their robustness as they show sharp drops in JGA for conversations containing utterances or dialog flows with realistic perturbations. Inspired by CheckList (Ribeiro et al., 2020), we design a collection of metrics called CheckDST that facilitate comparisons of DST models on comprehensive dimensions of robustness by testing well-known weaknesses with augmented test sets. We evaluate recent DST models with CheckDST and argue that models should be assessed more holistically rather than pursuing state-of-the-art on JGA since a higher JGA does not guarantee better overall robustness. We find that span-based classification models are resilient to unseen named entities but not robust to language variety, whereas those based on autoregressive language models generalize better to language variety but tend to memorize named entities and often hallucinate. Due to their respective weaknesses, neither approach is yet suitable for real-world deployment. We believe CheckDST is a useful guide for future research to develop task-oriented dialogue models that embody the strengths of various methods.

[1]  Fabrizio Silvestri,et al.  How Decoding Strategies Affect the Verifiability of Generated Text , 2020, FINDINGS.

[2]  Xifeng Yan,et al.  CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers , 2020, ArXiv.

[3]  Jianfeng Gao,et al.  Few-shot Natural Language Generation for Task-Oriented Dialog , 2020, FINDINGS.

[4]  Alborz Geramifard,et al.  Annotation Inconsistency and Entity Bias in MultiWOZ , 2021, SIGDIAL.

[5]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[6]  Hongguang Li,et al.  Robustness Testing of Language Understanding in Task-Oriented Dialog , 2020, ACL.

[7]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ArXiv.

[8]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[9]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Maxine Eskénazi,et al.  Structured Fusion Networks for Dialog , 2019, SIGdial.

[12]  Qi Liu,et al.  Multi-Task Self-Supervised Learning for Disfluency Detection , 2019, AAAI.

[13]  Yongbin Li,et al.  Preview, Attend and Review: Schema-Aware Curriculum Learning for Multi-Domain Dialogue State Tracking , 2021, ACL.

[14]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Jason Weston,et al.  ParlAI: A Dialog Research Software Platform , 2017, EMNLP.

[16]  Alborz Geramifard,et al.  DAIR: Data Augmented Invariant Regularization , 2021, ArXiv.

[17]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[18]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[19]  Jianfeng Gao,et al.  RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems , 2020, ACL.

[20]  Richard Socher,et al.  A Simple Language Model for Task-Oriented Dialogue , 2020, NeurIPS.

[21]  Baolin Peng,et al.  Soloist: Building Task Bots at Scale with Transfer Learning and Machine Teaching , 2021, Transactions of the Association for Computational Linguistics.

[22]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[23]  Hongbo Zhang,et al.  Quora Question Pairs , 2017 .

[24]  Raghav Gupta,et al.  Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset , 2020, AAAI.

[25]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[26]  Sonal Gupta,et al.  Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.

[27]  Erik Nijkamp,et al.  Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation , 2020, ACL.

[28]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ArXiv.

[29]  Nanyun Peng,et al.  Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[30]  Ryuichi Takanobu,et al.  MultiWOZ 2.3: A Multi-domain Task-Oriented Dialogue Dataset Enhanced with Annotation Corrections and Co-Reference Annotation , 2020, NLPCC.

[31]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[32]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[33]  Zhijian Ou,et al.  Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context , 2019, AAAI.

[34]  Nurul Lubis,et al.  TripPy: A Triple Copy Strategy for Value Independent Neural Dialog State Tracking , 2020, SIGdial.

[35]  Dilek Z. Hakkani-Tür,et al.  DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue , 2020, ArXiv.

[36]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[37]  Paul A. Crook,et al.  Teaching Models new APIs: Domain-Agnostic Simulators for Task Oriented Dialogue , 2021, ArXiv.

[38]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[39]  Mihail Eric,et al.  MultiWOZ 2. , 2019 .

[40]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[41]  Elman Mansimov,et al.  Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System , 2021, ACL.