Computer-produced summaries have traditionally been evaluated by comparing them with human-produced summaries using the Fmeasure. However, the F-measure is not appropriate when alternative sentences are possible in a human-produced extract. In this paper, we examine some evaluation methods devised to overcome the problem, including utility-based evaluation. By giving scores for moderately important sentences that does not appear in the human-produced extract, utility-based evaluation can resolve the problem. However, the method requires much effort from humans to provide data for evaluation. In this paper, we first propose a pseudo-utilitybased evaluation that uses human-produced extracts at different compression ratios. To evaluate the effectiveness of pseudo-utility-based evaluation, we compare our method and the F-measure using the data of the Text Summarization Challenge (TSC), and show that pseudoutility-based evaluation can resolve this problem. Next, we focus on content-based evaluation. Instead of measuring the ratio of sentences that match exactly in the extract, the method evaluates extracts by comparing their content words to those of human-produced extracts. Although the method has been reported to be effective in resolving the problem, it has not been examined in the context of comparing two extracts produced from different systems. We evaluated computer-produced summaries by content-based evaluation, and compared the results with a subjective evaluation. We found that the evaluation by content-based measure matched those by subjective evaluation in 93% of the cases, if the gap in content-based scores between two summaries is more than 0.2.
[1]
Robert L. Donaway,et al.
A Comparison of Rankings Produced by Summarization Evaluation Measures
,
2000
.
[2]
神門 典子,et al.
NTCIR workshop 2 meeting : proceedings of the Second NTCIR workshop meeting on evaluation of Chinese & Japanese text retrieval and text summarization
,
2001
.
[3]
Dragomir R. Radev,et al.
Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies
,
2000,
ArXiv.
[4]
Kathleen R. McKeown,et al.
Columbia multi-document summarization : Approach and evaluation
,
2001
.
[5]
Mark T. Maybury,et al.
Automatic Summarization
,
2002,
Computational Linguistics.
[6]
Manabu Okumura,et al.
Text summarization challenge 2: text summarization evaluation at NTCIR workshop 3
,
2001,
HLT-NAACL 2003.
[7]
Karen Sparck Jones,et al.
Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review
,
1996,
CL.
[8]
Kathleen R. McKeown,et al.
Summarization Evaluation Methods: Experiments and Analysis
,
1998
.
[9]
Manabu Okumura,et al.
Text Summarization Challenge 2 text summarization evaluation at NTCIR workshop 3
,
2004,
SIGF.
[10]
Jade Goldstein-Stewart,et al.
Selecting Text Spans for Document Summaries: Heuristics and Metrics
,
1999,
AAAI/IAAI.