Know Thy Strengths: Comprehensive Dialogue State Tracking Diagnostics

Recent works that revealed the vulnerability of dialogue state tracking (DST) models to distributional shifts have made holistic comparisons on robustness and qualitative analyses increasingly important for understanding their relative performance. We present our findings from standardized and comprehensive DST diagnoses, which have previously been sparse and uncoordinated, using our toolkit, CheckDST, a collection of robustness tests and failure mode analytics. We discover that different classes of DST models have clear strengths and weaknesses, where generation models are more promising for handling language variety while span-based classification models are more robust to unseen entities. Prompted by this discovery, we also compare checkpoints from the same model and find that the standard practice of selecting checkpoints using validation loss/accuracy is prone to overfitting and each model class has distinct patterns of failure. Lastly, we demonstrate how our diagnoses motivate a pre-finetuning procedure with non-dialogue data that offers comprehensive improvements to generation models by alleviating the impact of distributional shifts through transfer learning.

[1]  Jeffrey P. Bigham,et al.  InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning , 2022, EMNLP.

[2]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[3]  Min-Yen Kan,et al.  Interpreting the Robustness of Neural NLP Models to Textual Perturbations , 2021, FINDINGS.

[4]  Elman Mansimov,et al.  Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System , 2021, ACL.

[5]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[6]  Baolin Peng,et al.  Soloist: Building Task Bots at Scale with Transfer Learning and Machine Teaching , 2021, Transactions of the Association for Computational Linguistics.

[7]  Zhouhan Lin,et al.  Annotation Inconsistency and Entity Bias in MultiWOZ , 2021, SIGDIAL.

[8]  Jay Pujara,et al.  Probing Commonsense Explanation in Dialogue Response Generation , 2021, EMNLP.

[9]  Noah A. Smith,et al.  Competency Problems: On Finding and Removing Artifacts in Language Data , 2021, EMNLP.

[10]  Sonal Gupta,et al.  Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.

[11]  Hongguang Li,et al.  Robustness Testing of Language Understanding in Task-Oriented Dialog , 2020, ACL.

[12]  Jianfeng Gao,et al.  RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems , 2020, ACL.

[13]  Dilek Z. Hakkani-Tür,et al.  Overview of the Ninth Dialog System Technology Challenge: DSTC9 , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Caiming Xiong,et al.  CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers , 2020, ICLR.

[15]  Minlie Huang,et al.  MultiWOZ 2.3: A Multi-domain Task-Oriented Dialogue Dataset Enhanced with Annotation Corrections and Co-Reference Annotation , 2020, NLPCC.

[16]  Dilek Z. Hakkani-Tür,et al.  DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue , 2020, ArXiv.

[17]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[18]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[19]  Carel van Niekerk,et al.  TripPy: A Triple Copy Strategy for Value Independent Neural Dialog State Tracking , 2020, SIGDIAL.

[20]  Aditi Raghunathan,et al.  Robust Encodings: A Framework for Combating Adversarial Typos , 2020, ACL.

[21]  R. Socher,et al.  A Simple Language Model for Task-Oriented Dialogue , 2020, Neural Information Processing Systems.

[22]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[23]  Seung-won Hwang,et al.  SQuAD2-CR: Semi-supervised Annotation for Cause and Rationales for Unanswerability in SQuAD 2.0 , 2020, LREC.

[24]  Jianfeng Gao,et al.  Few-shot Natural Language Generation for Task-Oriented Dialog , 2020, FINDINGS.

[25]  Fabio Petroni,et al.  How Decoding Strategies Affect the Verifiability of Generated Text , 2019, FINDINGS.

[26]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[27]  Changjian Hu,et al.  GECOR: An End-to-End Generative Ellipsis and Co-reference Resolution Model for Task-Oriented Dialogue , 2019, EMNLP.

[28]  Raghav Gupta,et al.  Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset , 2019, AAAI.

[29]  Percy Liang,et al.  Distributionally Robust Language Modeling , 2019, EMNLP.

[30]  Aditi Raghunathan,et al.  Certified Robustness to Adversarial Word Substitutions , 2019, EMNLP.

[31]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[32]  Bill Byrne,et al.  Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset , 2019, EMNLP.

[33]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[34]  Qi Liu,et al.  Multi-Task Self-Supervised Learning for Disfluency Detection , 2019, AAAI.

[35]  Anuj Kumar Goyal,et al.  MultiWOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines , 2019, LREC.

[36]  Christopher Joseph Pal,et al.  Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study , 2019, ACL.

[37]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[38]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[39]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[40]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[41]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[42]  Jason Weston,et al.  ParlAI: A Dialog Research Software Platform , 2017, EMNLP.

[43]  Radha Poovendran,et al.  Deceiving Google's Perspective API Built for Detecting Toxic Comments , 2017, ArXiv.

[44]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[45]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[46]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[47]  Alborz Geramifard,et al.  DAIR: Data Augmented Invariant Regularization , 2021, ArXiv.

[48]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49]  Hongbo Zhang,et al.  Quora Question Pairs , 2017 .

[50]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.