论文信息 - The Effects of Human Variation in DUC Summarization Evaluation

The Effects of Human Variation in DUC Summarization Evaluation

There is a long history of research in automatic text summarization systems by both the text retrieval and the natural language processing communities, but evaluation of such systems’ output has always presented problems. One critical problem remains how to handle the unavoidable variability in human judgments at the core of all the evaluations. Sponsored by the DARPA TIDES project, NIST launched a new text summarization evaluation effort, called DUC, in 2001 with follow-on workshops in 2002 and 2003. Human judgments provided the foundation for all three evaluations and this paper examines how the variation in those judgments does and does not affect the results and their interpretation.

Paul Over | Donna Harman | P. Over | D. Harman

[1] Daniel Marcu,et al. Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[2] Ellen M. Voorhees,et al. Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[3] Simone Teufel,et al. Examining the consensus between human summaries: initial experiments with factoid analysis , 2003, HLT-NAACL 2003.

[4] Ani Nenkova,et al. Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[5] Kathleen R. McKeown,et al. Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[6] Eduard Hovy,et al. Manual and automatic evaluation of summaries , 2002, ACL 2002.