Using Question Series to Evaluate Question Answering System Effectiveness
暂无分享,去创建一个
The original motivation for using question series in the TREC 2004 question answering track was the desire to model aspects of dialogue processing in an evaluation task that included different question types. The structure introduced by the series also proved to have an important additional benefit: the series is at an appropriate level of granularity for aggregating scores for an effective evaluation. The series is small enough to be meaningful at the task level since it represents a single user interaction, yet it is large enough to avoid the highly skewed score distributions exhibited by single questions. An analysis of the reliability of the per-series evaluation shows the evaluation is stable for differences in scores seen in the track.
[1] Ellen M. Voorhees. Evaluating Answers to Definition Questions , 2003, HLT-NAACL.
[2] Tsuneaki Kato,et al. Handling Information Access Dialogue through QA Technologies - A novel challenge for open-domain question answering , 2004 .
[3] Ellen M. Voorhees,et al. The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.
[4] Ellen M. Voorhees,et al. Overview of the TREC 2004 Novelty Track. , 2005 .