Using the Euclidean Distance for Retrieval Evaluation

In information retrieval systems and digital libraries, retrieval result evaluation is a very important aspect. Up to now, almost all commonly used metrics such as average precision and recall level precision are ranking based metrics. In this work, we investigate if it is a good option to use a score based method, the Euclidean distance, for retrieval evaluation. Two variations of it are discussed: one uses the linear model to estimate the relation between rank and relevance in resultant lists, and the other uses a more sophisticated cubic regression model for this. Our experiments with two groups of submitted results to TREC demonstrate that the introduced new metrics have strong correlation with ranking based metrics when we consider the average of all 50 queries. On the other hand, our experiments also show that one of the variations (the linear model) has better overall quality than all those ranking based metrics involved. Another surprising finding is that a commonly used metric, average precision, may not be as good as previously thought.

[1]  Shengli Wu,et al.  Regression Relevance Models for Data Fusion , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[2]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[3]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[4]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[5]  Jacques Savoy,et al.  Database merging strategy based on logistic regression , 2000, Inf. Process. Manag..

[6]  Shengli Wu,et al.  Retrieval Result Presentation and Evaluation , 2010, KSEM.

[7]  Gianluca Elia,et al.  An e-Learning System Supporting the Problem-Based-Learning Approach: the Case of "Virtual eBMS" , 2007 .

[8]  Shengli Wu,et al.  Evaluation of System Measures for Incomplete Relevance Judgment in IR , 2006, FQAS.

[9]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[10]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[11]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[12]  Javed A. Aslam,et al.  Relevance score normalization for metasearch , 2001, CIKM '01.

[13]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[14]  P. Willett,et al.  SIGIR '97 : proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, Pennsylvania, USA, July 27-July 31, 1997 , 1997 .

[15]  Shengli Wu,et al.  Evaluating Score Normalization Methods in Data Fusion , 2006, AIRS.