论文信息 - Task-based evaluation of text summarization using Relevance Prediction - 字舞流文

Task-based evaluation of text summarization using Relevance Prediction

This article introduces a new task-based evaluation measure called Relevance Prediction that is a more intuitive measure of an individual's performance on a real-world task than interannotator agreement. Relevance Prediction parallels what a user does in the real world task of browsing a set of documents using standard search tools, i.e., the user judges relevance based on a short summary and then that same user-not an independent user-decides whether to open (and judge) the corresponding document. This measure is shown to be a more reliable measure of task performance than LDC Agreement, a current gold-standard based measure used in the summarization evaluation community. Our goal is to provide a stable framework within which developers of new automatic measures may make stronger statistical statements about the effectiveness of their measures in predicting summary usefulness. We demonstrate-as a proof-of-concept methodology for automatic metric developers-that a current automatic evaluation measure has a better correlation with Relevance Prediction than with LDC Agreement and that the significance level for detected differences is higher for the former than for the latter.

Richard M. Schwartz | Christof Monz | Bonnie J. Dorr | Stacy President Hobson | R. Schwartz | Christof Monz | B. Dorr

[1] S. Siegel,et al. Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[2] Marti A. Hearst,et al. HLT-NAACL 2003 : Human Language Technology conference of the North American Chapter of the Association for Computational Linguistics : proceedings of the main conference : May 27 to June 1, 2003, Edmonton, Alberta, Canada , 2003 .

[3] Eduard Hovy,et al. Evaluating DUC 2005 using Basic Elements , 2005 .

[4] Therese Firmin Hand,et al. A Proposal for Task-based Evaluation of Text Summarization Systems , 1997, Workshop On Intelligent Scalable Text Summarization.

[5] Bonnie Dorr,et al. Headline Evaluation Experiment Results , .

[6] R. Schwartz,et al. Unsupervised Topic Discovery , 2001 .

[7] Richard M. Schwartz,et al. Hedge Trimmer: A Parse-and-Trim Approach to Headline Generation , 2003, HLT-NAACL 2003.

[8] Inderjeet Mani,et al. Summarization Evaluation: An Overview , 2001, NTCIR.

[9] Kathleen R. McKeown,et al. Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[10] Eduard Hovy,et al. Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[11] Valentin Jijkoun,et al. The University of Amsterdam at CLEF@QA 2007 , 2006, CLEF.

[12] Chin-Yew Lin,et al. ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[13] Inderjeet Mani,et al. SUMMAC: a text summarization evaluation , 2002, Natural Language Engineering.

[14] Perry R. Hinton. Statistics Explained: A Guide for Social Science Students , 1995 .

[15] Donna Harman,et al. Information Processing and Management , 2022 .

[16] Richard M. Schwartz,et al. BBN/UMD at DUC-2004: Topiary , 2004 .

[17] Chris D. Paice,et al. Constructing literature abstracts by computer: Techniques and prospects , 1990, Inf. Process. Manag..

[18] Lisa F. Rau,et al. Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[19] Khurshid Ahmad,et al. Summary evaluation and text categorization , 2003, SIGIR '03.

[20] Hoa Trang Dang,et al. Overview of DUC 2005 , 2005 .

[21] Bonnie J. Dorr,et al. Text summarization evaluation: Correlation human performance on an extrinsic task with automatic intrinsic metrics , 2007 .

[22] Maarten de Rijke,et al. The University of Amsterdam at CLEF 2003 , 2001, CLEF.

[23] Liang Zhou,et al. A Web-Trained Extraction Summarization System , 2003, NAACL.

[24] Treebank Penn,et al. Linguistic Data Consortium , 1999 .

[25] Dragos Stefan Munteanu,et al. ParaEval: Using Paraphrases to Evaluate Summaries Automatically , 2006, NAACL.

[26] Barbara Di Eugenio,et al. Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[27] Mark Sanderson,et al. Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[28] Debashis Kushary,et al. Bootstrap Methods and Their Application , 2000, Technometrics.

[29] Jimmy J. Lin,et al. Automatically Evaluating Answers to Definition Questions , 2005, HLT.

[30] Jean Carletta,et al. Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[31] Ani Nenkova,et al. Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[32] Eduard H. Hovy,et al. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[33] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[34] H. P. Edmundson,et al. New Methods in Automatic Extracting , 1969, JACM.

[35] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[36] Christof Monz. Minimal Span Weighting Retrieval for Question Answering , 2004 .

[37] Paul Over,et al. Intrinsic Evaluation of Generic News Text Summarization Systems , 2003 .