Evaluating Information Content by Factoid Analysis: Human annotation and stability

We present a new approach to intrinsic summary evaluation, based on initial experiments in van Halteren and Teufel (2003), which combines two novel aspects: comparison of information content (rather than string similarity) in gold standard and system summary, measured in shared atomic information units which we call factoids, and comparison to more than one gold standard summary (in our data: 20 and 50 summaries respectively). In this paper, we show that factoid annotation is highly reproducible, introduce a weighted factoid score, estimate how many summaries are required for stable system rankings, and show that the factoid scores cannot be sufficiently approximated by unigrams and the DUC information overlap measure.