论文信息 - A Simple Measure to Assess Non-response

A Simple Measure to Assess Non-response

There are several tasks where is preferable not responding than responding incorrectly. This idea is not new, but despite several previous attempts there isn't a commonly accepted measure to assess non-response. We study here an extension of accuracy measure with this feature and a very easy to understand interpretation. The measure proposed (c@1) has a good balance of discrimination power, stability and sensitivity properties. We show also how this measure is able to reward systems that maintain the same number of correct answers and at the same time decrease the number of incorrect ones, by leaving some questions unanswered. This measure is well suited for tasks such as Reading Comprehension tests, where multiple choices per question are given, but only one is correct.

Anselmo Peñas | Álvaro Rodrigo | Anselmo Peñas | Álvaro Rodrigo

[1] Maarten de Rijke,et al. Overview of the CLEF 2004 Multilingual Question Answering Track , 2004, CLEF.

[2] Tetsuya Sakai,et al. Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[3] Tetsuya Sakai,et al. On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..

[4] Ellen M. Voorhees,et al. The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[5] M. Felisa Verdejo,et al. Testing the Reasoning for Question Answering Validation , 2008, J. Log. Comput..

[6] Anselmo Peñas,et al. Overview of ResPubliQA 2009: Question Answering Evaluation over European Legislation , 2009, CLEF.

[7] M. Felisa Verdejo,et al. Question Answering Pilot Task at CLEF 2004 , 2004, CLEF.

[8] Tetsuya Sakai,et al. On the reliability of factoid question answering evaluation , 2007, TALIP.

[9] Ellen M. Voorhees,et al. The Twelfth Text Retrieval Conference, TREC 2003 , 2004 .

[10] Ellen M. Voorhees,et al. Evaluating evaluation measure stability , 2000, SIGIR '00.

[11] Ellen M. Voorhees,et al. Overview of the TREC 2004 Novelty Track. , 2005 .

[12] M. Felisa Verdejo,et al. Evaluating Answer Validation in Multi-stream Question Answering , 2008, EVIA@NTCIR.

[13] M. Felisa Verdejo,et al. Overview of the Answer Validation Exercise 2007 , 2006, CLEF.

[14] M. Felisa Verdejo,et al. Evaluating question answering validation as a classification problem , 2012, Lang. Resour. Evaluation.

[15] Ellen M. Voorhees,et al. The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[16] M. Felisa Verdejo,et al. Overview of the Answer Validation Exercise 2006 , 2006, CLEF.

[17] Jennifer Chu-Carroll,et al. Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..