The Pyramid Method: Incorporating human content selection variation in summarization evaluation

Human variation in content selection in summarization has given rise to some fundamental research questions: How can one incorporate the observed variation in suitable evaluation measures? How can such measures reflect the fact that summaries conveying different content can be equally good and informative? In this article, we address these very questions by proposing a method for analysis of multiple human abstracts into semantic content units. Such analysis allows us not only to quantify human variation in content selection, but also to assign empirical importance weight to different content units. It serves as the basis for an evaluation method, the Pyramid Method, that incorporates the observed variation and is predictive of different equally informative summaries. We discuss the reliability of content unit annotation, the properties of Pyramid scores, and their correlation with other evaluation methods.

[1]  Kathleen R. McKeown,et al.  Applying the Pyramid Method in DUC 2005 , 2005 .

[2]  Robert L. Donaway,et al.  A Comparison of Rankings Produced by Summarization Evaluation Measures , 2000 .

[3]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[4]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Jean-Luc Minel,et al.  How to Appreciate the Quality of Automatic Text Summarization? Examples of FAN and MLUCE Protocols and their Results on SERAPHIN , 1997, ACL 1997.

[7]  Hans van Halteren,et al.  Evaluating Information Content by Factoid Analysis: Human annotation and stability , 2004, EMNLP.

[8]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[9]  Simone Teufel,et al.  Examining the consensus between human summaries: initial experiments with factoid analysis , 2003, HLT-NAACL 2003.

[10]  Dragomir R. Radev,et al.  Single-document and multi-document summary evaluation using Relative Utility , 2007 .

[11]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[12]  Daniel Marcu,et al.  Discourse-Based Summarization in DUC-2001 , 2001 .

[13]  Dragomir R. Radev,et al.  Summarization evaluation using relative utility , 2003, CIKM '03.

[14]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[15]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[16]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Novelty Track. , 2005 .

[17]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[18]  Kathleen R. McKeown,et al.  Columbia multi-document summarization : Approach and evaluation , 2001 .

[19]  Gustave J. Rath,et al.  The formation of abstracts by the selection of sentences , 1961 .

[20]  K. Krippendorff Krippendorff, Klaus, Content Analysis: An Introduction to its Methodology . Beverly Hills, CA: Sage, 1980. , 1980 .

[21]  Rebecca J. Passonneau,et al.  Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation , 2006, LREC.

[22]  Kathleen R. McKeown,et al.  Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[23]  Inderjeet Mani,et al.  Summarization Evaluation: An Overview , 2001, NTCIR.

[24]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[25]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[26]  R. Garner Efficient Text Summarization Costs and Benefits , 1982 .

[27]  McKeownKathleen,et al.  The Pyramid Method , 2007 .

[28]  Jimmy J. Lin,et al.  Evaluating Summaries and Answers: Two Sides of the Same Coin? , 2005, IEEvaluation@ACL.

[29]  Ronald E. Johnson Recall of prose as a function of the structural importance of the linguistic units. , 1970 .

[30]  F. Massey,et al.  Introduction to Statistical Analysis , 1970 .

[31]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.