A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

Abstract The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high uncertainty in the reliability of automatic metrics. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do so in some evaluation settings.1

[1]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[2]  Dan Klein,et al.  An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.

[3]  John M. Conroy,et al.  OCCAMS -- An Optimal Combinatorial Covering Algorithm for Multi-document Summarization , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[4]  D. Bonett,et al.  Sample size requirements for estimating pearson, kendall and spearman correlations , 2000 .

[5]  George A. Vouros,et al.  Summarization system evaluation revisited: N-gram graphs , 2008, TSLP.

[6]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[7]  John M. Conroy,et al.  A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art , 2013, ACL.

[8]  Hoa Trang Dang,et al.  Overview of the TAC 2008 Update Summarization Task , 2008, TAC.

[9]  Ido Dagan,et al.  Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation , 2019, NAACL.

[10]  Eduard H. Hovy,et al.  Summarization Evaluation Using Transformed Basic Elements , 2008, TAC.

[11]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[12]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[13]  S. Lewis,et al.  Regression analysis , 2007, Practical Neurology.

[14]  Yvette Graham,et al.  Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE , 2015, EMNLP.

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[17]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[18]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[19]  Y. B. Wah,et al.  Power comparisons of Shapiro-Wilk , Kolmogorov-Smirnov , Lilliefors and Anderson-Darling tests , 2011 .

[20]  John M. Conroy,et al.  An Assessment of the Accuracy of Automatic Evaluation in Summarization , 2012, EvalMetrics@NAACL-HLT.

[21]  O. J. Dunn,et al.  Comparison of Tests of the Equality of Dependent Correlation Coefficients , 1971 .

[22]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[23]  Rotem Dror,et al.  Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets , 2017, TACL.

[24]  George Giannakopoulos,et al.  Summarization System Evaluation Variations Based on N-Gram Graphs , 2010, TAC.

[25]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[26]  Markus Freitag,et al.  Results of the WMT20 Metrics Shared Task , 2020, WMT.

[27]  Rotem Dror,et al.  Statistical Significance Testing for Natural Language Processing , 2020, Synthesis Lectures on Human Language Technologies.

[28]  Dan Roth,et al.  SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics , 2020, NLPOSS.

[29]  Richard Socher,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, ArXiv.

[30]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[31]  Graham Neubig,et al.  Re-evaluating Evaluation in Text Summarization , 2020, EMNLP.

[32]  Yang Liu,et al.  Non-Expert Evaluation of Summarization Systems is Risky , 2010, Mturk@HLT-NAACL.

[33]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[34]  George Giannakopoulos,et al.  Summary Evaluation: Together We Stand NPowER-ed , 2013, CICLing.

[35]  Daniel Khashabi,et al.  Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses , 2020, ACL.

[36]  Dan Roth,et al.  Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary , 2021, Transactions of the Association for Computational Linguistics.

[37]  Dianne P. O'Leary,et al.  Ranking Human and Machine Summarization Systems , 2011, EMNLP.

[38]  Iryna Gurevych,et al.  Learning to Score System Summaries for Better Content Selection Evaluation. , 2017, NFiS@EMNLP.

[39]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[40]  John M. Conroy,et al.  Better Metrics to Automatically Predict the Quality of a Text Summary , 2012, Algorithms.