A Semi-automatic Evaluation Scheme

The evaluations of many NLP applications share a two-step procedure: the processing and aggregation of reference data to yield a gold-standard data set, and the subsequent comparison with peer system results. Traditionally, the former has been performed by human annotators and the latter has been conducted either manually or automatically. Moving toward a fully automated evaluation procedure, we propose a novel semi-automatic evaluation scheme where the reference and peer data are automatically nuggetized and a carefully designed annotation procedure is followed for subsequent comparisons. High interannotator agreements from both reference and peer annotations show that machineproduced nuggets are informative and can be utilized in evaluation environments. In addition, standardizing the nugget creation process affords us the opportunity to look beyond surface-level phrasal differences toward semantic equivalency.