The Effects of Human Variation in DUC Summarization Evaluation

There is a long history of research in automatic text summarization systems by both the text retrieval and the natural language processing communities, but evaluation of such systems’ output has always presented problems. One critical problem remains how to handle the unavoidable variability in human judgments at the core of all the evaluations. Sponsored by the DARPA TIDES project, NIST launched a new text summarization evaluation effort, called DUC, in 2001 with follow-on workshops in 2002 and 2003. Human judgments provided the foundation for all three evaluations and this paper examines how the variation in those judgments does and does not affect the results and their interpretation.