论文信息 - Auto-hMDS: Automatic Construction of a Large Heterogeneous Multilingual Multi-Document Summarization Corpus

Auto-hMDS: Automatic Construction of a Large Heterogeneous Multilingual Multi-Document Summarization Corpus

Automatic text summarization is a challenging natural language processing (NLP) task which has been researched for several decades. The available datasets for multi-document summarization (MDS) are, however, rather small and usually focused on the newswire genre. Nowadays, machine learning methods are applied to more and more NLP problems such as machine translation, question answering, and single-document summarization. Modern machine learning methods such as neural networks require large training datasets which are available for the three tasks but not yet for MDS. This lack of training data limits the development of machine learning methods for MDS. In this work, we automatically generate a large heterogeneous multilingual multi-document summarization corpus. The key idea is to use Wikipedia articles as summaries and to automatically search for appropriate source documents. We created a corpus with 7,316 topics in English and German, which has variing summary lengths and variing number of source documents. More information about the corpus can be found at the corpus GitHub page at https://github.com/AIPHES/auto-hMDS.

Markus Zopf | Markus Zopf

[1] Dilek Z. Hakkani-Tür,et al. The ICSI/UTD Summarization System at TAC 2009 , 2009, TAC.

[2] Hui Lin,et al. A Class of Submodular Functions for Document Summarization , 2011, ACL.

[3] Dragomir R. Radev,et al. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[4] Kathleen R. McKeown,et al. Applying the Pyramid Method in DUC 2005 , 2005 .

[5] Iryna Gurevych,et al. A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[6] Judith Eckle-Kohler,et al. The Next Step for Multi-Document Summarization: A Heterogeneous Multi-Genre Corpus Built with a Novel Construction Approach , 2016, COLING.

[7] Alexander M. Rush,et al. Abstractive Sentence Summarization with Attentive Recurrent Neural Networks , 2016, NAACL.

[8] Benjamin Van Durme,et al. Annotated Gigaword , 2012, AKBC-WEKEX@NAACL-HLT.

[9] Houfeng Wang,et al. Learning Summary Prior Representation for Extractive Summarization , 2015, ACL.

[10] Piji Li,et al. Deep Recurrent Generative Decoder for Abstractive Text Summarization , 2017, EMNLP.

[11] Ming Zhou,et al. A Redundancy-Aware Sentence Regression Framework for Extractive Summarization , 2016, COLING.