论文信息 - Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?

Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper discusses the validity of the evaluation method used in the Document Understanding Conference (DUC) and evaluates five different ROUGE metrics: ROUGE-N, ROUGE-L, ROUGEW, ROUGE-S, and ROUGE-SU included in the ROUGE summarization evaluation package using data provided by DUC. A comprehensive study of the effects of using single or multiple references and various sample sizes on the stability of the results is also presented.

Chin-Yew Lin | Chin-Yew Lin

[1] Anthony C. Davison,et al. Bootstrap Methods and Their Application , 1998 .

[2] Eduard Hovy,et al. Manual and automatic evaluation of summaries , 2002, ACL 2002.

[3] Simone Teufel,et al. Examining the consensus between human summaries: initial experiments with factoid analysis , 2003, HLT-NAACL 2003.

[4] Manabu Okumura,et al. Text Summarization Challenge 2 text summarization evaluation at NTCIR workshop 3 , 2004, SIGF.

[5] Paul Over,et al. Intrinsic Evaluation of Generic News Text Summarization Systems , 2003 .

[6] Sadaoki Furui,et al. Evaluation method for automatic speech summarization , 2003, INTERSPEECH.

[7] Eduard H. Hovy,et al. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[8] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[9] Ani Nenkova,et al. Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[10] Wai Lam,et al. Meta-evaluation of Summaries in a Cross-lingual Environment using Content-based Metrics , 2002, COLING.

[11] Manabu Okumura,et al. Text summarization challenge 2: text summarization evaluation at NTCIR workshop 3 , 2001, HLT-NAACL 2003.

[12] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.