Facet-Aware Evaluation for Extractive Summarization

Commonly adopted metrics for extractive summarization focus on lexical overlap at the token level. In this paper, we present a facet-aware evaluation setup for better assessment of the information coverage in extracted summaries. Specifically, we treat each sentence in the reference summary as a \textit{facet}, identify the sentences in the document that express the semantics of each facet as \textit{support sentences} of the facet, and automatically evaluate extractive summarization methods by comparing the indices of extracted sentences and support sentences of all the facets in the reference summary. To facilitate this new evaluation setup, we construct an extractive version of the CNN/Daily Mail dataset and perform a thorough quantitative investigation, through which we demonstrate that facet-aware evaluation manifests better correlation with human judgment than ROUGE, enables fine-grained evaluation as well as comparative analysis, and reveals valuable insights of state-of-the-art summarization methods. Data can be found at this https URL.

[1]  Mirella Lapata,et al.  Neural Summarization by Extracting Sentences and Words , 2016, ACL.

[2]  Tiejun Zhao,et al.  Neural Document Summarization by Jointly Learning to Score and Select Sentences , 2018, ACL.

[3]  Kavita Ganesan,et al.  ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks , 2015, ArXiv.

[4]  Min Sun,et al.  A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss , 2018, ACL.

[5]  Mirella Lapata,et al.  Document Modeling with External Attention for Sentence Extraction , 2018, ACL.

[6]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[7]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[8]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[9]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[10]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[11]  Xifeng Yan,et al.  What It Takes to Achieve 100% Condition Accuracy on WikiSQL , 2018, EMNLP.

[12]  Danqi Chen,et al.  A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[13]  Mirella Lapata,et al.  Neural Latent Extractive Document Summarization , 2018, EMNLP.

[14]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[15]  Mirella Lapata,et al.  Ranking Sentences for Extractive Summarization with Reinforcement Learning , 2018, NAACL.

[16]  Yen-Chun Chen,et al.  Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting , 2018, ACL.

[17]  Kathleen McKeown,et al.  Content Selection in Deep Learning Models of Summarization , 2018, EMNLP.

[18]  Natalie Schluter,et al.  The limits of automatic summarisation according to ROUGE , 2017, EACL.

[19]  Cong Yu,et al.  Generating Representative Headlines for News Stories , 2020, WWW.

[20]  Johannes Fürnkranz,et al.  Which Scores to Predict in Sentence Regression for Text Summarization? , 2018, NAACL-HLT.

[21]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[22]  Jackie Chi Kit Cheung,et al.  BanditSum: Extractive Summarization as a Contextual Bandit , 2018, EMNLP.

[23]  Qian Yang,et al.  PEAK: Pyramid Evaluation via Automated Knowledge Extraction , 2016, AAAI.

[24]  Eduard Hovy,et al.  Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization , 2019, EMNLP.

[25]  Ani Nenkova,et al.  The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization , 2019, EMNLP.

[26]  Bowen Zhou,et al.  SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[27]  Shashi Narayan,et al.  HighRES: Highlight-based Reference-less Evaluation of Summarization , 2019, ACL.

[28]  Hoa Trang Dang,et al.  Overview of the TAC 2008 Update Summarization Task , 2008, TAC.

[29]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[30]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[31]  Fang Chen,et al.  A Graph-theoretic Summary Evaluation for ROUGE , 2018, EMNLP.

[32]  Jun-Ping Ng,et al.  Better Summarization Evaluation with Word Embeddings for ROUGE , 2015, EMNLP.

[33]  Ido Dagan,et al.  Evaluating Multiple System Summary Lengths: A Case Study , 2018, EMNLP.

[34]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[35]  Marianna Apidianaki,et al.  SUM-QE: a BERT-based Summary Quality Estimation Model , 2019, EMNLP.