Evaluating question answering validation as a classification problem

Formulating Question Answering Validation as a classification problem facilitates the introduction of Machine Learning techniques to improve the overall performance of Question Answering systems. The different proportion of positive and negative examples in the evaluation collections has led to the use of measures based on precision and recall. However, an evaluation based on the analysis of Receiver Operating Characteristic (ROC) space is sometimes preferred in classification with unbalanced collections. In this article we compare both evaluation approaches according to their rationale, their stability, their discrimination power and their adequacy to the particularities of the Answer Validation task.

[1]  Charles P. Friedman,et al.  Evaluation Methods in Medical Informatics , 1997, Computers and Medicine.

[2]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[3]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..

[4]  Robert C. Holte,et al.  What ROC Curves Can't Do (and Cost Curves Can) , 2004, ROCAI.

[5]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[6]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[7]  Tetsuya Sakai,et al.  On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..

[8]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[9]  M. Felisa Verdejo,et al.  Overview of the Answer Validation Exercise 2007 , 2006, CLEF.

[10]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[11]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[12]  M. Felisa Verdejo,et al.  Testing the Reasoning for Question Answering Validation , 2008, J. Log. Comput..

[13]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[14]  Sanda M. Harabagiu,et al.  Methods for Using Textual Entailment in Open-Domain Question Answering , 2006, ACL.

[15]  Sanda M. Harabagiu,et al.  The Structure and Performance of an Open-Domain Question Answering System , 2000, ACL.