TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts

Recent models in developing summarization systems consist of millions of parameters and the model performance is highly dependent on the abundance of training data. While most existing summarization corpora contain data in the order of thousands to one million, generation of large-scale summarization datasets in order of couple of millions is yet to be explored. Practically, more data is better at generalizing the training patterns to unseen data. In this paper, we introduce TLDR9+ –a large-scale summarization dataset– containing over 9 million training instances extracted from Reddit discussion forum ([HTTP]). This dataset is specifically gathered to perform extreme summarization (i.e., generating one-sentence summary in high compression and abstraction) and is more than twice larger than the previously proposed dataset. We go one step further and with the help of human annotations, we distill a more fine-grained dataset by sampling High-Quality instances from TLDR9+ and call it TLDRHQ dataset. We further pinpoint different state-of-the-art summarization models on our proposed datasets.

[1]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2]  Mor Naaman,et al.  Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies , 2018, NAACL.

[3]  Sven Behnke,et al.  NAT: Noise-Aware Training for Robust Neural Sequence Labeling , 2020, ACL.

[4]  Jackie Chi Kit Cheung,et al.  BanditSum: Extractive Summarization as a Contextual Bandit , 2018, EMNLP.

[5]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[6]  Daniel S. Weld,et al.  TLDR: Extreme Summarization of Scientific Documents , 2020, FINDINGS.

[7]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[8]  Jackie Chi Kit Cheung,et al.  Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses , 2019, EMNLP.

[9]  Lu Wang,et al.  BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization , 2019, ACL.

[10]  Kathleen McKeown,et al.  Content Selection in Deep Learning Models of Summarization , 2018, EMNLP.

[11]  Alhanoof Althnian,et al.  Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain , 2021, Applied Sciences.

[12]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[13]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[14]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[15]  Ankit Kumar,et al.  Noisy Text Data: Achilles’ Heel of BERT , 2020, WNUT.

[16]  Nazli Goharian,et al.  Ontology-Aware Clinical Abstractive Summarization , 2019, SIGIR.

[17]  Xue Ying,et al.  An Overview of Overfitting and its Solutions , 2019, Journal of Physics: Conference Series.

[18]  Gunhee Kim,et al.  Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Nazli Goharian,et al.  GUIR at SemEval-2020 Task 12: Domain-Tuned Contextualized Models for Offensive Language Detection , 2020, SEMEVAL.

[21]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[22]  R. Alpert,et al.  Communications Through Limited-Response Questioning , 1954 .

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Jason Weston,et al.  A Neural Attention Model for Sentence Summarization , 2015 .

[25]  Nazli Goharian,et al.  Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization , 2020, ACL.

[26]  Nazli Goharian,et al.  On Generating Extended Summaries of Long Documents , 2020, SDU@AAAI.

[27]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[28]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[29]  Hassan Foroosh,et al.  Better Highlighting: Creating Sub-Sentence Summary Highlights , 2020, EMNLP.

[30]  Diana Inkpen,et al.  Estimating User Location in Social Media with Stacked Denoising Auto-encoders , 2015, VS@HLT-NAACL.

[31]  Xuedong Huang,et al.  Make Lead Bias in Your Favor: Zero-shot Abstractive News Summarization , 2019 .

[32]  Anastassia Kornilova,et al.  BillSum: A Corpus for Automatic Summarization of US Legislation , 2019, EMNLP.

[33]  Daniele Pighin,et al.  Stepwise Extractive Summarization and Planning with Structured Transformers , 2020, EMNLP.

[34]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[35]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[36]  Benno Stein,et al.  TL;DR: Mining Reddit to Learn Automatic Summarization , 2017, NFiS@EMNLP.

[37]  Franck Dernoncourt,et al.  Learning to Fuse Sentences with Transformers for Summarization , 2020, EMNLP.

[38]  Bowen Zhou,et al.  SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[39]  Alexander M. Rush,et al.  Generating Abstractive Summaries with Finetuned Language Models , 2019, INLG.