A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.

[1]  Mor Naaman,et al.  Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies , 2018, NAACL.

[2]  Kai Hong,et al.  Improving the Estimation of Word Importance for News Multi-Document Summarization , 2014, EACL.

[3]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[4]  Ming Zhou,et al.  A Redundancy-Aware Sentence Regression Framework for Extractive Summarization , 2016, COLING.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Jiawei Han,et al.  Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions , 2010, COLING.

[7]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[8]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[9]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[10]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[11]  Prasenjit Mitra,et al.  Multi-Document Abstractive Summarization Using ILP Based Multi-Sentence Compression , 2015, IJCAI.

[12]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[13]  Yllias Chali,et al.  Abstractive Unsupervised Multi-Document Summarization using Paraphrastic Sentence Fusion , 2018, COLING.

[14]  Johannes Fürnkranz,et al.  Which Scores to Predict in Sentence Regression for Text Summarization? , 2018, NAACL-HLT.

[15]  Dragomir R. Radev,et al.  Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , 2019, ACL.

[16]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[17]  Benoit Favre,et al.  A Scalable Global Model for Summarization , 2009, ILP 2009.

[18]  Rui Zhang,et al.  Graph-based Neural Multi-Document Summarization , 2017, CoNLL.

[19]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[22]  Yllias Chali,et al.  Towards Abstractive Multi-Document Summarization Using Submodular Function-Based Framework, Sentence Compression and Merging , 2017, IJCNLP.

[23]  Ming Zhou,et al.  Ranking with Recursive Neural Networks and Its Application to Multi-Document Summarization , 2015, AAAI.

[24]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[25]  Xiaojun Wan,et al.  Towards a Neural Network Approach to Abstractive Multi-Document Summarization , 2018, ArXiv.

[26]  Judith Eckle-Kohler,et al.  A General Optimization Framework for Multi-Document Summarization Using Genetic Algorithms and Swarm Intelligence , 2016, COLING.

[27]  Markus Zopf,et al.  Auto-hMDS: Automatic Construction of a Large Heterogeneous Multilingual Multi-Document Summarization Corpus , 2018, LREC.