Re-evaluating Evaluation in Text Summarization

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.

[1]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[2]  Pengfei Liu,et al.  Heterogeneous Graph Neural Networks for Extractive Document Summarization , 2020, ACL.

[3]  Ondrej Bojar,et al.  Results of the WMT16 Metrics Shared Task , 2016 .

[4]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[5]  Deyu Zhou,et al.  Neural Storyline Extraction Model for Storyline Generation from News Articles , 2018, NAACL.

[6]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[7]  Hoa Trang Dang,et al.  Overview of DUC 2005 , 2005 .

[8]  Pengfei Liu,et al.  Extractive Summarization as Text Matching , 2020, ACL.

[9]  Ming Zhou,et al.  HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[10]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[11]  Noah A. Smith,et al.  Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts , 2019, ACL.

[12]  Jianfeng Gao,et al.  An Information-Theoretic Approach to Automatic Evaluation of Summaries , 2006, NAACL.

[13]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[14]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[15]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[16]  Tiejun Zhao,et al.  Neural Document Summarization by Jointly Learning to Score and Select Sentences , 2018, ACL.

[17]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[18]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[19]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[20]  John M. Conroy,et al.  A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art , 2013, ACL.

[21]  Yen-Chun Chen,et al.  Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting , 2018, ACL.

[22]  Xuanjing Huang,et al.  Searching for Effective Neural Extractive Summarization: What Works and What’s Next , 2019, ACL.

[23]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[24]  Jun-Ping Ng,et al.  Better Summarization Evaluation with Word Embeddings for ROUGE , 2015, EMNLP.

[25]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[26]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[27]  Hoa Trang Dang,et al.  Overview of DUC 2006 , 2006 .

[28]  Percy Liang,et al.  The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.

[29]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[30]  S. Lewis,et al.  Regression analysis , 2007, Practical Neurology.

[31]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[32]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[33]  Yvette Graham,et al.  Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE , 2015, EMNLP.

[34]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[35]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[36]  Mirella Lapata,et al.  Ranking Sentences for Extractive Summarization with Reinforcement Learning , 2018, NAACL.

[37]  Kathleen McKeown,et al.  Content Selection in Deep Learning Models of Summarization , 2018, EMNLP.

[38]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[39]  Ani Nenkova,et al.  Automatically Assessing Machine Summary Content Without a Gold Standard , 2013, CL.

[40]  Iryna Gurevych,et al.  Learning to Score System Summaries for Better Content Selection Evaluation. , 2017, NFiS@EMNLP.

[41]  Angela Fan,et al.  Controllable Abstractive Summarization , 2017, NMT@ACL.

[42]  Mirella Lapata,et al.  Neural Latent Extractive Document Summarization , 2018, EMNLP.

[43]  Jaewoo Kang,et al.  Learning by Semantic Similarity Makes Abstractive Summarization Better , 2020, ArXiv.

[44]  Jackie Chi Kit Cheung,et al.  BanditSum: Extractive Summarization as a Contextual Bandit , 2018, EMNLP.

[45]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[46]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[47]  Ido Dagan,et al.  Better Rewards Yield Better Summaries: Learning to Summarise Without References , 2019, EMNLP.

[48]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[49]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[50]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[51]  Ji Wang,et al.  Pretraining-Based Natural Language Generation for Text Summarization , 2019, CoNLL.

[52]  Hoa Trang Dang,et al.  Overview of the TAC 2008 Update Summarization Task , 2008, TAC.

[53]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[54]  Ido Dagan,et al.  Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation , 2019, NAACL.

[55]  Maxime Peyrard,et al.  Studying Summarization Evaluation Metrics in the Appropriate Scoring Range , 2019, ACL.

[56]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.