Evaluation Challenges in Large-Scale Document Summarization

We present a large-scale meta evaluation of eight evaluation measures for both single-document and multi-document summarizers. To this end we built a corpus consisting of (a) 100 Million automatic summaries using six summarizers and baselines at ten summary lengths in both English and Chinese, (b) more than 10,000 manual abstracts and extracts, and (c) 200 Million automatic document and summary retrievals using 20 queries. We present both qualitative and quantitative results showing the strengths and draw-backs of all evaluation methods and how they rank the different summarizers.

[1]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[2]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[3]  Inderjeet Mani,et al.  Summariz-ing Similarities and Differences Among Related Doc-uments , 2000, AAAI Conference on Artificial Intelligence.

[4]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[5]  Karen Spärck Jones,et al.  Generic summaries for indexing in information retrieval , 2001, SIGIR '01.

[6]  Gerald Salton,et al.  Automatic text processing , 1988 .

[7]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[8]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[9]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[10]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[11]  Inderjeet Mani,et al.  Summarizing Similarities and Differences Among Related Documents , 1997, Information Retrieval.

[12]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[13]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[14]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[15]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[16]  K. Krippendorff Krippendorff, Klaus, Content Analysis: An Introduction to its Methodology . Beverly Hills, CA: Sage, 1980. , 1980 .

[17]  A. Stuart,et al.  Non-Parametric Statistics for the Behavioral Sciences. , 1957 .

[18]  Inderjeet Mani,et al.  SUMMAC: a text summarization evaluation , 2002, Natural Language Engineering.

[19]  Chin-Yew Lin,et al.  Automated Text Summarization , 2005, IJCNLP.