Artemis: A Novel Annotation Methodology for Indicative Single Document Summarization

We describe Artemis (Annotation methodology for Rich, Tractable, Extractive, Multi-domain, Indicative Summarization), a novel hierarchical annotation process that produces indicative summaries for documents from multiple domains. Current summarization evaluation datasets are single-domain and focused on a few domains for which naturally occurring summaries can be easily found, such as news and scientific articles. These are not sufficient for training and evaluation of summarization models for use in document management and information retrieval systems, which need to deal with documents from multiple domains. Compared to other annotation methods such as Relative Utility and Pyramid, Artemis is more tractable because judges don't need to look at all the sentences in a document when making an importance judgment for one of the sentences, while providing similarly rich sentence importance annotations. We describe the annotation process in detail and compare it with other similar evaluation systems. We also present analysis and experimental results over a sample set of 532 annotated documents.

[1]  Lu Wang,et al.  BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization , 2019, ACL.

[2]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[3]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[4]  Mirella Lapata,et al.  Neural Summarization by Extracting Sentences and Words , 2016, ACL.

[5]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[6]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[7]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[8]  Horacio Saggion,et al.  Generating Indicative-Informative Summaries with SumUM , 2002, Computational Linguistics.

[9]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[10]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[11]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[12]  Simone Teufel,et al.  Examining the consensus between human summaries: initial experiments with factoid analysis , 2003, HLT-NAACL 2003.

[13]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[14]  Shashi Narayan,et al.  HighRES: Highlight-based Reference-less Evaluation of Summarization , 2019, ACL.

[15]  Gunhee Kim,et al.  Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.

[16]  Min-Yen Kan,et al.  Applying Natural Language Generation to Indicative Summarization , 2001, EWNLG@ACL.

[17]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[18]  Inderjeet Mani,et al.  Summarization Evaluation: An Overview , 2001, NTCIR.

[19]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[20]  Min-Yen Kan,et al.  Using the Annotated Bibliography as a Resource for Indicative Summarization , 2002, LREC.

[21]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[22]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[23]  Hoa Trang Dang,et al.  Overview of DUC 2005 , 2005 .

[24]  Kathleen R. McKeown,et al.  Domain-specific informative and indicative summarization for information retrieval , 2001 .

[25]  Bowen Zhou,et al.  SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[26]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[27]  Kathleen McKeown,et al.  Content Selection in Deep Learning Models of Summarization , 2018, EMNLP.

[28]  Ani Nenkova,et al.  Automatically Assessing Machine Summary Content Without a Gold Standard , 2013, CL.

[29]  Chin-Yew Lin,et al.  Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough? , 2004, NTCIR.

[30]  Mor Naaman,et al.  Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies , 2018, NAACL.

[31]  Dragomir R. Radev,et al.  Single-document and multi-document summary evaluation using Relative Utility , 2007 .