Discourse Comprehension: A Question Answering Framework to Represent Sentence Connections

While there has been substantial progress in text comprehension through simple factoid question answering, more holistic comprehension of a discourse still presents a major challenge (Dunietz et al., 2020). Someone critically reflecting on a text as they read it will pose curiosity-driven, often open-ended questions, which reflect deep understanding of the content and require complex reasoning to answer (Ko et al., 2020; Westera et al., 2020). A key challenge in building and evaluating models for this type of discourse comprehension is the lack of annotated data, especially since finding answers to such questions (which may not be answered at all) requires high cognitive load for annotators over long documents. This paper presents a novel paradigm that enables scalable data collection targeting the comprehension of news documents, viewing these questions through the lens of discourse. The resulting corpus, DCQA ( D iscourse C omprehension by Q uestion A nswering), consists of 22,394 question-answer pairs across 606 English documents. DCQA captures both discourse and semantic links between sentences in the form of free-form, open-ended questions. On an evaluation set that we annotated on questions from Ko et al. (2020), we show that DCQA provides valuable supervision for answering open-ended questions. We additionally design pre-training methods utilizing existing question-answering resources, and use synthetic data to accommodate unanswerable questions. We release DCQA

[1]  Shuyang Cao,et al.  Controllable Open-ended Question Generation with A New Question Type Ontology , 2021, ACL.

[2]  Aurko Roy,et al.  Hurdles to Progress in Long-form Question Answering , 2021, NAACL.

[3]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[4]  Marcel Worring,et al.  NLQuAD: A Non-Factoid Long Question Answering Data Set , 2021, EACL.

[5]  Reut Tsarfaty,et al.  QADiscourse - Discourse Relations as QA Pairs: Representation, Crowdsourcing and Baselines , 2020, EMNLP.

[6]  Junyi Jessy Li,et al.  Inquisitive Question Generation for High Level Text Comprehension , 2020, EMNLP.

[7]  Aaron Lee,et al.  Discourse as a Function of Event: Profiling Discourse Structure in News Articles around the Main Event , 2020, ACL.

[8]  Jennifer Chu-Carroll,et al.  To Test Machine Comprehension, Start by Defining Comprehension , 2020, ACL.

[9]  Hannah Rohde,et al.  TED-Q: TED Talks and the Questions they Evoke , 2020, LREC.

[10]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[11]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[12]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[13]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[14]  Jason Weston,et al.  ELI5: Long Form Question Answering , 2019, ACL.

[15]  Arndt Riester,et al.  Constructing QUD Trees , 2019, Questions in Discourse.

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[18]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[19]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[20]  Nils Reiter,et al.  QUD-Based Annotation of Discourse Structure and Information Structure: Tool and Evaluation , 2018, LREC.

[21]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[22]  David I. Beaver,et al.  Question-based Models of Information Structure , 2016 .

[23]  Chris Callison-Burch,et al.  Problems in Current Text Simplification Research: New Data Can Help , 2015, TACL.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[26]  Alex Lascarides,et al.  Segmented Discourse Representation Theory: Dynamic Semantics With Discourse Structure , 2008 .

[27]  Rashmi Prasad A Discourse-based Approach to Generating Why-Questions from Texts , 2008 .

[28]  Edward Gibson,et al.  Representing Discourse Coherence: A Corpus-Based Study , 2005, CL.

[29]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[30]  A. Graesser,et al.  Mechanisms that generate questions , 1992 .

[31]  D. K. Davis News as Discourse , 1989 .

[32]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[33]  J. Hobbs On the coherence and structure of discourse , 1985 .

[34]  R. Brookshire,et al.  Comprehension of main ideas and details in coherent and noncoherent discourse by aphasic and nonaphasic listeners , 1984, Brain and Language.