Task-based evaluation of text summarization using Relevance Prediction

This article introduces a new task-based evaluation measure called Relevance Prediction that is a more intuitive measure of an individual's performance on a real-world task than interannotator agreement. Relevance Prediction parallels what a user does in the real world task of browsing a set of documents using standard search tools, i.e., the user judges relevance based on a short summary and then that same user-not an independent user-decides whether to open (and judge) the corresponding document. This measure is shown to be a more reliable measure of task performance than LDC Agreement, a current gold-standard based measure used in the summarization evaluation community. Our goal is to provide a stable framework within which developers of new automatic measures may make stronger statistical statements about the effectiveness of their measures in predicting summary usefulness. We demonstrate-as a proof-of-concept methodology for automatic metric developers-that a current automatic evaluation measure has a better correlation with Relevance Prediction than with LDC Agreement and that the significance level for detected differences is higher for the former than for the latter.

[1]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[2]  Marti A. Hearst,et al.  HLT-NAACL 2003 : Human Language Technology conference of the North American Chapter of the Association for Computational Linguistics : proceedings of the main conference : May 27 to June 1, 2003, Edmonton, Alberta, Canada , 2003 .

[3]  Eduard Hovy,et al.  Evaluating DUC 2005 using Basic Elements , 2005 .

[4]  Therese Firmin Hand,et al.  A Proposal for Task-based Evaluation of Text Summarization Systems , 1997, Workshop On Intelligent Scalable Text Summarization.

[5]  Bonnie Dorr,et al.  Headline Evaluation Experiment Results , .

[6]  R. Schwartz,et al.  Unsupervised Topic Discovery , 2001 .

[7]  Richard M. Schwartz,et al.  Hedge Trimmer: A Parse-and-Trim Approach to Headline Generation , 2003, HLT-NAACL 2003.

[8]  Inderjeet Mani,et al.  Summarization Evaluation: An Overview , 2001, NTCIR.

[9]  Kathleen R. McKeown,et al.  Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[10]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[11]  Valentin Jijkoun,et al.  The University of Amsterdam at CLEF@QA 2007 , 2006, CLEF.

[12]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[13]  Inderjeet Mani,et al.  SUMMAC: a text summarization evaluation , 2002, Natural Language Engineering.

[14]  Perry R. Hinton Statistics Explained: A Guide for Social Science Students , 1995 .

[15]  Donna Harman,et al.  Information Processing and Management , 2022 .

[16]  Richard M. Schwartz,et al.  BBN/UMD at DUC-2004: Topiary , 2004 .

[17]  Chris D. Paice,et al.  Constructing literature abstracts by computer: Techniques and prospects , 1990, Inf. Process. Manag..

[18]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[19]  Khurshid Ahmad,et al.  Summary evaluation and text categorization , 2003, SIGIR '03.

[20]  Hoa Trang Dang,et al.  Overview of DUC 2005 , 2005 .

[21]  Bonnie J. Dorr,et al.  Text summarization evaluation: Correlation human performance on an extrinsic task with automatic intrinsic metrics , 2007 .

[22]  Maarten de Rijke,et al.  The University of Amsterdam at CLEF 2003 , 2001, CLEF.

[23]  Liang Zhou,et al.  A Web-Trained Extraction Summarization System , 2003, NAACL.

[24]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[25]  Dragos Stefan Munteanu,et al.  ParaEval: Using Paraphrases to Evaluate Summaries Automatically , 2006, NAACL.

[26]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[27]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[28]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[29]  Jimmy J. Lin,et al.  Automatically Evaluating Answers to Definition Questions , 2005, HLT.

[30]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[31]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[32]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[33]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[34]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[35]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[36]  Christof Monz Minimal Span Weighting Retrieval for Question Answering , 2004 .

[37]  Paul Over,et al.  Intrinsic Evaluation of Generic News Text Summarization Systems , 2003 .