Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE

We provide an analysis of current evaluation methodologies applied to summarization metrics and identify the following areas of concern: (1) movement away from evaluation by correlation with human assessment; (2) omission of important components of human assessment from evaluations, in addition to large numbers of metric variants; (3) absence of methods of significance testing improvements over a baseline. We outline an evaluation methodology that overcomes all such challenges, providing the first method of significance testing suitable for evaluation of summarization metrics. Our evaluation reveals for the first time which metric variants significantly outperform others, optimal metric variants distinct from current recommended best variants, as well as machine translation metric BLEU to have performance on-par with ROUGE for the purpose of evaluation of summarization systems. We subsequently replicate a recent large-scale evaluation that relied on, what we now know to be, suboptimal ROUGE variants revealing distinct conclusions about the relative performance of state-of-the-art summarization systems.

[1]  Hui Lin,et al.  A Repository of State of the Art and Competitive Baseline Summaries for Generic News Summarization , 2014, LREC.

[2]  John M. Conroy,et al.  An Assessment of the Accuracy of Automatic Evaluation in Summarization , 2012, EvalMetrics@NAACL-HLT.

[3]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[4]  Timothy Baldwin,et al.  Randomized Significance Tests in Machine Translation , 2014, WMT@ACL.

[5]  Helen Yannakoudakis,et al.  Modeling coherence in ESOL learner texts , 2012, BEA@NAACL-HLT.

[6]  Ani Nenkova,et al.  Automatic Evaluation of Linguistic Quality in Multi-Document Summarization , 2010, ACL.

[7]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[8]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[9]  Paul Over,et al.  DUC in context , 2007, Inf. Process. Manag..

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Yvette Graham,et al.  Improving Evaluation of Machine Translation Quality Estimation , 2015, ACL.

[12]  Klaus Zechner,et al.  Prompt-based Content Scoring for Automated Spoken Language Assessment , 2013, BEA@NAACL-HLT.

[13]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[14]  Dianne P. O'Leary,et al.  Ranking Human and Machine Summarization Systems , 2011, EMNLP.

[15]  J. P. Royston,et al.  Algorithm AS 181: The W Test for Normality , 1982 .

[16]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[17]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[18]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[19]  J. H. Steiger Tests for comparing elements of a correlation matrix. , 1980 .

[20]  S. Lewis,et al.  Regression analysis , 2007, Practical Neurology.