Auto-hMDS: Automatic Construction of a Large Heterogeneous Multilingual Multi-Document Summarization Corpus

Automatic text summarization is a challenging natural language processing (NLP) task which has been researched for several decades. The available datasets for multi-document summarization (MDS) are, however, rather small and usually focused on the newswire genre. Nowadays, machine learning methods are applied to more and more NLP problems such as machine translation, question answering, and single-document summarization. Modern machine learning methods such as neural networks require large training datasets which are available for the three tasks but not yet for MDS. This lack of training data limits the development of machine learning methods for MDS. In this work, we automatically generate a large heterogeneous multilingual multi-document summarization corpus. The key idea is to use Wikipedia articles as summaries and to automatically search for appropriate source documents. We created a corpus with 7,316 topics in English and German, which has variing summary lengths and variing number of source documents. More information about the corpus can be found at the corpus GitHub page at https://github.com/AIPHES/auto-hMDS.

[1]  Dilek Z. Hakkani-Tür,et al.  The ICSI/UTD Summarization System at TAC 2009 , 2009, TAC.

[2]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[3]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[4]  Kathleen R. McKeown,et al.  Applying the Pyramid Method in DUC 2005 , 2005 .

[5]  Iryna Gurevych,et al.  A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[6]  Judith Eckle-Kohler,et al.  The Next Step for Multi-Document Summarization: A Heterogeneous Multi-Genre Corpus Built with a Novel Construction Approach , 2016, COLING.

[7]  Alexander M. Rush,et al.  Abstractive Sentence Summarization with Attentive Recurrent Neural Networks , 2016, NAACL.

[8]  Benjamin Van Durme,et al.  Annotated Gigaword , 2012, AKBC-WEKEX@NAACL-HLT.

[9]  Houfeng Wang,et al.  Learning Summary Prior Representation for Extractive Summarization , 2015, ACL.

[10]  Piji Li,et al.  Deep Recurrent Generative Decoder for Abstractive Text Summarization , 2017, EMNLP.

[11]  Ming Zhou,et al.  A Redundancy-Aware Sentence Regression Framework for Extractive Summarization , 2016, COLING.

[12]  Xiaojun Wan,et al.  Abstractive Document Summarization with a Graph-Based Attentional Neural Model , 2017, ACL.

[13]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[14]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[15]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[16]  Johannes Fürnkranz,et al.  Beyond Centrality and Structural Features: Learning Information Importance for Text Summarization , 2016, CoNLL.

[17]  Bowen Zhou,et al.  SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[18]  Paul Over,et al.  DUC in context , 2007, Inf. Process. Manag..

[19]  McKeownKathleen,et al.  The Pyramid Method , 2007 .

[20]  John M. Conroy,et al.  An Assessment of the Accuracy of Automatic Evaluation in Summarization , 2012, EvalMetrics@NAACL-HLT.