Summarization Evaluation Methods: Experiments and Analysis

Two methods are used for evaluation of summarization systems: an evaluation of generated summaries against an "ideal" summary and evaluation of how well summaries help a person perform in a task such as informa. tion retrieval. We carried out two large experiments to study the two evaluation methods. Our results show that different parameters of an experiment can (h-amatically affect how well a system scores. For example, summary length was found to affect both types of evaluations. For the "ideal" summary based evaluation, accuracy decreases as summary length increases, while for task based evaluations summary length and accuracy on an information retrieval task appear to correlate randomly. In this paper, we show how this parameter and others can affect evaluation results and describe how parameters can be controlled to produce a sound evaluation.

[1]  Seiji Miike,et al.  A full-text retrieval system with a dynamic abstract generation function , 1994, SIGIR '94.

[2]  George M. Kasper,et al.  The Effects and Limitations of Automated Text Condensing on Reading Comprehension Performance , 1992, Inf. Syst. Res..

[3]  Chris Buckley,et al.  The Importance of Proper Weighting Methods , 1993, HLT.

[4]  Therese Firmin Hand,et al.  A Proposal for Task-based Evaluation of Text Summarization Systems , 1997, Workshop On Intelligent Scalable Text Summarization.

[5]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[6]  Frances C. Johnson,et al.  The application of linguistic processing to automatic abstract generation , 1997 .

[7]  Inderjeet Mani,et al.  Multi-Document Summarization by Graph Search and Matching , 1997, AAAI/IAAI.

[8]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[9]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[10]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[11]  Vasileios Hatzivassiloglou,et al.  Towards the Automatic Identification of Adjectival Scales: Clustering Adjectives According to Meaning , 1993, ACL.

[12]  Ronald E. Johnson Recall of prose as a function of the structural importance of the linguistic units. , 1970 .

[13]  Chris D. Paice,et al.  Constructing literature abstracts by computer: Techniques and prospects , 1990, Inf. Process. Manag..

[14]  Seiji Miike,et al.  Abstract Generation Based on Rhetorical Structure Extraction , 1994, COLING.

[15]  David Yarowsky,et al.  Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs , 1992, ACL.

[16]  Karen Spärck Jones Towards Better NLP System Evaluation , 1994, HLT.

[17]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[18]  Rebecca J. Passonneau,et al.  Intention-Based Segmentation: Human Reliability and Correlation with Linguistic Cues , 1993, ACL.

[19]  Gustave J. Rath,et al.  The formation of abstracts by the selection of sentences , 1961 .

[20]  G. E. Kline,et al.  Recall of prose as a function of age and input modality. , 1978, Journal of gerontology.

[21]  Daniel Marcu,et al.  From discourse structures to text summaries , 1997 .

[22]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.