Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfluencies is an under-studied topic in NLP, even though it is ubiquitous in human conversation. This is largely due to the lack of datasets containing disfluencies. In this paper, we present a new challenge question answering dataset, DISFL-QA, a derivative of SQUAD, where humans introduce contextual disfluencies in previously fluent questions. DISFL-QA contains a variety of challenging disfluencies that require a more comprehensive understanding of the text than what was necessary in prior datasets. Experiments show that the performance of existing state-of-the-art question answering models degrades significantly when tested on DISFLQA in a zero-shot setting. We show data augmentation methods partially recover the loss in performance and also demonstrate the efficacy of using gold data for fine-tuning. We argue that we need large-scale disfluency datasets in order for NLP models to be robust to them. The dataset is publicly available at: https://github.com/ google-research-datasets/disfl-qa.

[1]  Ankur Taly,et al.  Did the Model Understand the Question? , 2018, ACL.

[2]  Yue Zhang,et al.  Transition-Based Disfluency Detection using LSTMs , 2017, EMNLP.

[3]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[4]  Florian Metze,et al.  NoiseQA: Challenge Set Evaluation for User-Centric Question Answering , 2021, EACL.

[5]  Diyi Yang,et al.  Planning and Generating Natural and Diverse Disfluent Texts as Augmentation for Disfluency Detection , 2020, EMNLP.

[6]  Matthew Lease,et al.  An Improved Model for Recognizing Disfluencies in Conversational Speech , 2004 .

[7]  Mark Johnson,et al.  Disfluency Detection using Auto-Correlational Neural Networks , 2018, EMNLP.

[8]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[9]  Mark Johnson,et al.  Joint Incremental Disfluency Detection and Dependency Parsing , 2014, TACL.

[10]  Elizabeth Shriberg DISFLUENCIES IN SWITCHBOARD , 1996 .

[11]  Jianfeng Gao,et al.  RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems , 2020, ACL.

[12]  Mari Ostendorf,et al.  Multi-domain disfluency and repair detection , 2014, INTERSPEECH.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Mari Ostendorf,et al.  Disfluencies and Human Speech Transcription Errors , 2019, INTERSPEECH.

[15]  Elisabeth Schriberg,et al.  Preliminaries to a Theory of Speech Disfluencies , 1994 .

[16]  Mark Johnson,et al.  Disfluency Detection using a Noisy Channel Model and a Deep Neural Language Model , 2017, ACL.

[17]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[18]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[19]  Mark Johnson,et al.  Improving Disfluency Detection by Self-Training a Self-Attentive Model , 2020, ACL.

[20]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[21]  Mari Ostendorf,et al.  Disfluency Detection Using a Bidirectional LSTM , 2016, INTERSPEECH.

[22]  Kirsty McDougall,et al.  Profiling fluency: An analysis of individual variation in disfluencies in adult males , 2017, Speech Commun..

[23]  Eugene Charniak,et al.  A TAG-based noisy-channel model of speech repairs , 2004, ACL.

[24]  Mihai Surdeanu,et al.  Design and performance analysis of a factoid question answering system for spontaneous speech transcriptions , 2006, INTERSPEECH.

[25]  Sameer Singh,et al.  Generating Natural Adversarial Examples , 2017, ICLR.

[26]  Mari Ostendorf,et al.  Giving Attention to the Unexpected: Using Prosody Innovations in Disfluency Detection , 2019, NAACL.

[27]  Dan Klein,et al.  Disfluency Detection with a Semi-Markov Model and Prosodic Features , 2015, HLT-NAACL.

[28]  Graham Neubig,et al.  Mitigating Noisy Inputs for Question Answering , 2019, INTERSPEECH.

[29]  Ludivine Crible Discourse markers and (dis)fluency in English and French: variation and combination in the DisFrEn corpus , 2017 .

[30]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Hung-yi Lee,et al.  Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension , 2018, INTERSPEECH.

[32]  Qi Liu,et al.  Multi-Task Self-Supervised Learning for Disfluency Detection , 2019, AAAI.

[33]  Mark Johnson,et al.  The impact of language models and loss functions on repair disfluency detection , 2011, ACL.

[34]  Eugene Charniak,et al.  Edit Detection and Parsing for Transcribed Speech , 2001, NAACL.

[35]  Hongguang Li,et al.  Robustness Testing of Language Understanding in Task-Oriented Dialog , 2020, ACL.

[36]  Anton Leuski,et al.  Building Effective Question Answering Characters , 2006, SIGDIAL Workshop.