On Generating Extended Summaries of Long Documents

Prior work in document summarization has mainly focused on generating short summaries of a document. While this type of summary helps get a high-level view of a given document, it is desirable in some cases to know more detailed information about its salient points that can’t fit in a short summary. This is typically the case for longer documents such as a research paper, legal document, or a book. In this paper, we present a new method for generating extended summaries of long papers. Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model through a multi-task learning approach. We then present our results on three long summarization datasets, arXiv-Long, PubMed-Long, and Longsumm. Our method outperforms or matches the performance of strong baselines. Furthermore, we perform a comprehensive analysis over the generated results, shedding insights on future research for long-form summary generation task. Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences across diverse sections. Our datasets, and codes are publicly available at https: //github.com/Georgetown-IR-Lab/ExtendedSumm.

[1]  Yu Cheng,et al.  Discourse-Aware Neural Extractive Text Summarization , 2020, ACL.

[2]  Guy Lev,et al.  TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks , 2019, ACL.

[3]  Bowen Zhou,et al.  SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[4]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[5]  Nazli Goharian,et al.  Ontology-Aware Clinical Abstractive Summarization , 2019, SIGIR.

[6]  Nazli Goharian,et al.  Scientific document summarization via citation contextualization and scientific discourse , 2017, International Journal on Digital Libraries.

[7]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[8]  Xiang Ren,et al.  Multi-document Summarization with Maximal Marginal Relevance-guided Reinforcement Learning , 2020, EMNLP.

[9]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[10]  Jackie Chi Kit Cheung,et al.  Multi-Fact Correction in Abstractive Text Summarization , 2020, EMNLP.

[11]  Tiejun Zhao,et al.  Neural Document Summarization by Jointly Learning to Score and Select Sentences , 2018, ACL.

[12]  Nazli Goharian,et al.  Scientific Article Summarization Using Citation-Context and Article’s Discourse Structure , 2015, EMNLP.

[13]  Ruipeng Jia,et al.  Neural Extractive Summarization with Hierarchical Attentive Heterogeneous Graph Network , 2020, EMNLP.

[14]  Dragomir R. Radev,et al.  Coherent Citation-Based Summarization of Scientific Papers , 2011, ACL.

[15]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[16]  Eduard Hovy,et al.  Overview and Insights from the Shared Tasks at Scholarly Document Processing 2020: CL-SciSumm, LaySumm and LongSumm , 2020, SDP.

[17]  Nazli Goharian,et al.  GUIR @ LongSumm 2020: Learning to Generate Long Summaries from Scientific Documents , 2020, SDP.

[18]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[19]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[20]  Isabelle Augenstein,et al.  A Supervised Approach to Extractive Summarisation of Scientific Papers , 2017, CoNLL.

[21]  Bhavana Dalvi,et al.  Pretrained Language Models for Sequential Sentence Classification , 2019, EMNLP.

[22]  Grigorios Tsoumakas,et al.  A Divide-and-Conquer Approach to the Summarization of Academic Articles , 2020, ArXiv.

[23]  Giuseppe Carenini,et al.  Extractive Summarization of Long Documents by Combining Global and Local Context , 2019, EMNLP.

[24]  Sriparna Saha,et al.  IIITBH-IITP@CL-SciSumm20, CL-LaySumm20, LongSumm20 , 2020, SDP@EMNLP.

[25]  Li Fei-Fei,et al.  Dynamic Task Prioritization for Multitask Learning , 2018, ECCV.

[26]  Vasudeva Varma,et al.  Summaformers @ LaySumm 20, LongSumm 20 , 2020, SDP.

[27]  Wei Liu,et al.  CIST@CL-SciSumm 2020, LongSumm 2020: Automatic Scientific Document Summarization , 2020, SDP.

[28]  John M. Conroy,et al.  Section mixture models for scientific document summarization , 2017, International Journal on Digital Libraries.

[29]  Nazli Goharian,et al.  Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization , 2020, ACL.