Does Pretraining for Summarization Require Knowledge Transfer?

Pretraining techniques leveraging enormous datasets have driven recent advances in text summarization. While folk explanations suggest that knowledge transfer accounts for pretraining’s benefits, little is known about why it works or what makes a pretraining task or dataset suitable. In this paper, we challenge the knowledge transfer story, showing that pretraining on documents consisting of character n-grams selected at random, we can nearly match the performance of models pretrained on real corpora. This work holds the promise of eliminating upstream corpora, which may alleviate some concerns over offensive language, bias, and copyright issues. To see whether the small residual benefit of using real data could be accounted for by the structure of the pretraining task, we design several tasks motivated by a qualitative study of summarization corpora. However, these tasks confer no appreciable benefit, leaving open the possibility of a small role for knowledge transfer.1

[1]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[2]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[3]  David Grangier,et al.  Neural Text Generation from Structured Data with Application to the Biography Domain , 2016, EMNLP.

[4]  Ingmar Weber,et al.  Racial Bias in Hate Speech and Abusive Language Detection Datasets , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[5]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[6]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[7]  Ibrahim M. Alabdulmohsin,et al.  What Do Neural Networks Learn When Trained With Random Labels? , 2020, NeurIPS.

[8]  Mirella Lapata,et al.  Data-to-Text Generation with Content Selection and Planning , 2018, AAAI.

[9]  Behnam Neyshabur,et al.  What is being transferred in transfer learning? , 2020, NeurIPS.

[10]  Yue Zhang,et al.  A Graph-to-Sequence Model for AMR-to-Text Generation , 2018, ACL.

[11]  Ming Zhou,et al.  Pre-training for Abstractive Document Summarization by Reinstating Source Text , 2020, EMNLP.

[12]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[13]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[14]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[15]  Ido Dagan,et al.  Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation , 2019, NAACL.

[16]  Eduard Hovy,et al.  Manual and automatic evaluation of summaries , 2002, ACL 2002.

[17]  Shikha Bordia,et al.  Identifying and Reducing Gender Bias in Word-Level Language Models , 2019, NAACL.

[18]  Dan Jurafsky,et al.  Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models , 2020, EMNLP.

[19]  Eric Chu,et al.  MeanSum: A Neural Model for Unsupervised Multi-Document Abstractive Summarization , 2018, ICML.

[20]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[21]  Douwe Kiela,et al.  Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , 2021, EMNLP.

[22]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[23]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[24]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[25]  Hung-yi Lee,et al.  Pre-Training a Language Model Without Human Language , 2020, ArXiv.

[26]  Giuseppe Carenini,et al.  Modeling content and structure for abstractive review summarization , 2016, Comput. Speech Lang..

[27]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[28]  Christophe Gravier,et al.  Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples , 2017, J. Web Semant..

[29]  Aleksander Wawer,et al.  SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization , 2019, EMNLP.

[30]  Ming Zhou,et al.  ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training , 2020, FINDINGS.

[31]  Mirella Lapata,et al.  Text Generation from Knowledge Graphs with Graph Transformers , 2019, NAACL.

[32]  Claire Cardie,et al.  Domain-Independent Abstract Generation for Focused Meeting Summarization , 2013, ACL.

[33]  Lu Wang,et al.  Neural Network-Based Abstract Generation for Opinions and Arguments , 2016, NAACL.

[34]  Verena Rieser,et al.  The E2E Dataset: New Challenges For End-to-End Generation , 2017, SIGDIAL Conference.

[35]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[36]  Jungo Kasai,et al.  ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks , 2019, AAAI.

[37]  Jackie Chi Kit Cheung,et al.  Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses , 2019, EMNLP.

[38]  Lu Wang,et al.  BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization , 2019, ACL.

[39]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[40]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[41]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.