Single-document and multi-document summary evaluation using Relative Utility

We present a series of experiments to demonstrate the validity of Relative Utility (RU) as a measure for evaluating extractive summarization systems. Like some other evaluation metrics, it compares sentence selection between machine and reference summarizers. Additionally, RU is applicable in both single-document and multi-document summarization, is extendable to arbitrary compression rates with no extra annotation effort, and takes into account both random system performance and interjudge agreement. RU also provides an option for penalizing summaries that include sentences with redundant information. Our results are based on the JHU summary corpus and indicate that Relative Utility is a reasonable, and often superior alternative to several common summary evaluation metrics. We also give a comparison of RU with some other well-known metrics with respect to the correlation with the human judgements on the DUC corpus.

[1]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Inderjeet Mani,et al.  Summarizing Similarities and Differences Among Related Documents , 1997, Information Retrieval.

[4]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[5]  Kathleen R. McKeown,et al.  Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[6]  Dragomir R. Radev A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-Document Structure , 2000, SIGDIAL Workshop.

[7]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[8]  Dragomir R. Radev,et al.  MEAD ReDUCs: Michigan at DUC 2003 , 2003 .

[9]  Zhu Zhang,et al.  Towards CST-enhanced summarization , 2002, AAAI/IAAI.

[10]  Dragomir R. Radev,et al.  Learning cross-document structural relationships using boosting , 2003, CIKM '03.

[11]  Robert L. Donaway,et al.  A Comparison of Rankings Produced by Summarization Evaluation Measures , 2000 .

[12]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[13]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..