Long-Span Dependencies in Transformer-based Summarization Systems

Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization. Typically these systems are trained by fine-tuning a large pre-trained model to the target task. One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows. Thus, for long document summarization, it can be challenging to train or fine-tune these models. In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection. These approaches are compared on a range of network configurations. Experiments are carried out on standard long-span summarization tasks, including Spotify Podcast, arXiv, and PubMed datasets. We demonstrate that by combining these methods, we can achieve state-of-the-art results on all three tasks in the ROUGE scores. Moreover, without a large-scale GPU card, our approach can achieve comparable or better results than existing approaches.

[1]  R. Jones,et al.  TREC 2020 Podcasts Track Overview , 2021, TREC.

[2]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[3]  Daniele Pighin,et al.  Stepwise Extractive Summarization and Planning with Structured Transformers , 2020, EMNLP.

[4]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[5]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[6]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[7]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[8]  Yen-Chun Chen,et al.  Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting , 2018, ACL.

[9]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[10]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[11]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[12]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[13]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[14]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[15]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[16]  Ben Carterette,et al.  100,000 Podcasts: A Spoken English Document Corpus , 2020, COLING.

[17]  Wen Xiao,et al.  Systematically Exploring Redundancy Reduction in Summarizing Long Documents , 2020, AACL.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[20]  Caiming Xiong,et al.  CTRLsum: Towards Generic Controllable Text Summarization , 2020, EMNLP.

[21]  Sandeep Subramanian,et al.  On Extractive and Abstractive Neural Document Summarization with Transformer Language Models , 2020, EMNLP.

[22]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[23]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[26]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[27]  Mirella Lapata,et al.  Hierarchical Transformers for Multi-Document Summarization , 2019, ACL.

[28]  Pengfei Liu,et al.  GSum: A General Framework for Guided Neural Abstractive Summarization , 2021, NAACL.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Fei Liu,et al.  Automatic Summarization of Open-Domain Podcast Episodes , 2020, TREC.

[31]  M. Gales,et al.  CUED_SPEECH at TREC 2020 Podcast Summarisation Track , 2020, TREC.

[32]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[33]  Franck Dernoncourt,et al.  A Cascade Approach to Neural Abstractive Summarization with Content Selection and Fusion , 2020, AACL.

[34]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[35]  Mark J. F. Gales,et al.  Abstractive Spoken Document Summarization Using Hierarchical Model with Multi-Stage Attention Diversity Optimization , 2020, INTERSPEECH.

[36]  Min Yang,et al.  Abstractive Meeting Summarization via Hierarchical Adaptive Segmental Network Learning , 2019, WWW.

[37]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[38]  Grigorios Tsoumakas,et al.  A Divide-and-Conquer Approach to the Summarization of Academic Articles , 2020, ArXiv.

[39]  Giuseppe Carenini,et al.  Extractive Summarization of Long Documents by Combining Global and Local Context , 2019, EMNLP.

[40]  Ming Zhou,et al.  HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[41]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[42]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[43]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[44]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[45]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[46]  Omer Levy,et al.  Blockwise Self-Attention for Long Document Understanding , 2020, EMNLP.

[47]  Mirella Lapata,et al.  Neural Summarization by Extracting Sentences and Words , 2016, ACL.

[48]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[49]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[50]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[51]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[52]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[53]  Heng Ji,et al.  Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization , 2019, ACL.

[54]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[55]  Min Sun,et al.  A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss , 2018, ACL.