Estimating Summary Quality with Pairwise Preferences

Automatic evaluation systems in the field of automatic summarization have been relying on the availability of gold standard summaries for over ten years. Gold standard summaries are expensive to obtain and often require the availability of domain experts to achieve high quality. In this paper, we propose an alternative evaluation approach based on pairwise preferences of sentences. In comparison to gold standard summaries, they are simpler and cheaper to obtain. In our experiments, we show that humans are able to provide useful feedback in the form of pairwise preferences. The new framework performs better than the three most popular versions of ROUGE with less expensive human input. We also show that our framework can reuse already available evaluation data and achieve even better results.

[1]  Yvette Graham,et al.  Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE , 2015, EMNLP.

[2]  Vishal Gupta,et al.  Recent automatic text summarization techniques: a survey , 2016, Artificial Intelligence Review.

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Ani Nenkova,et al.  Automatic Evaluation of Linguistic Quality in Multi-Document Summarization , 2010, ACL.

[5]  Feifan Liu,et al.  Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries , 2008, ACL.

[6]  Paul Over,et al.  DUC in context , 2007, Inf. Process. Manag..

[7]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[8]  Craig MacDonald,et al.  On choosing an effective automatic evaluation metric for microblog summarisation , 2014, IIiX.

[9]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[10]  Hoa Trang Dang,et al.  Overview of the TAC 2008 Update Summarization Task , 2008, TAC.

[11]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[12]  John M. Conroy,et al.  A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art , 2013, ACL.

[13]  Christopher Rao,et al.  Graphs in Statistical Analysis , 2010 .

[14]  McKeownKathleen,et al.  The Pyramid Method , 2007 .

[15]  George Giannakopoulos,et al.  Multi-document multilingual summarization and evaluation tracks in ACL 2013 MultiLing Workshop , 2013 .

[16]  Nazli Goharian,et al.  Revisiting Summarization Evaluation for Scientific Articles , 2016, LREC.

[17]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[18]  Hans van Halteren,et al.  Evaluating Information Content by Factoid Analysis: Human annotation and stability , 2004, EMNLP.

[19]  Qian Yang,et al.  PEAK: Pyramid Evaluation via Automated Knowledge Extraction , 2016, AAAI.

[20]  Kai Hong,et al.  Improving the Estimation of Word Importance for News Multi-Document Summarization , 2014, EACL.

[21]  Dilek Z. Hakkani-Tür,et al.  The ICSI/UTD Summarization System at TAC 2009 , 2009, TAC.

[22]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[23]  Ani Nenkova,et al.  Automatically Assessing Machine Summary Content Without a Gold Standard , 2013, CL.

[24]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[25]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[26]  Jonas Sjöbergh,et al.  Older versions of the ROUGEeval summarization evaluation system were easier to fool , 2007, Inf. Process. Manag..

[27]  Ani Nenkova,et al.  The Pyramid Method: Incorporating human content selection variation in summarization evaluation , 2007, TSLP.

[28]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS , 1952 .

[29]  E. Zermelo Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung , 1929 .

[30]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[31]  Preslav Nakov,et al.  Pairwise Neural Machine Translation Evaluation , 2015, ACL.

[32]  George Giannakopoulos,et al.  Summary Evaluation: Together We Stand NPowER-ed , 2013, CICLing.

[33]  Johannes Fürnkranz,et al.  Beyond Centrality and Structural Features: Learning Information Importance for Text Summarization , 2016, CoNLL.

[34]  John M. Conroy,et al.  An Assessment of the Accuracy of Automatic Evaluation in Summarization , 2012, EvalMetrics@NAACL-HLT.

[35]  Simone Teufel,et al.  Examining the consensus between human summaries: initial experiments with factoid analysis , 2003, HLT-NAACL 2003.

[36]  Weiwei Guo,et al.  Automated Pyramid Scoring of Summaries using Distributional Semantics , 2013, ACL.

[37]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[38]  Hoa Trang Dang,et al.  Overview of DUC 2005 , 2005 .