Exploring the Impact of Inter-query Variability on the Performance of Retrieval Systems

This paper introduces a framework for evaluating the performance of information retrieval systems. Current evaluation metrics provide an average score that does not consider performance variability across the query set. In this manner, conclusions lack of any statistical significance, yielding poor inference to cases outside the query set and possibly unfair comparisons. We propose to apply statistical methods in order to obtain a more informative measure for problems in which different query classes can be identified. In this context, we assess the performance variability on two levels: overall variability across the whole query set and specific query class-related variability. To this end, we estimate confidence bands for precision-recall curves, and we apply ANOVA in order to assess the significance of the performance across different query classes.

[1]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[2]  A. Tamhane,et al.  Multiple Comparison Procedures , 1989 .

[3]  A. Tamhane,et al.  Multiple Comparison Procedures , 2009 .

[4]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[5]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Mitchell H. Gail,et al.  A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data , 1989 .

[7]  Andrew Zisserman,et al.  Of Gods and Goats: Weakly Supervised Learning of Figurative Art , 2013, BMVC.

[8]  Stéphan Clémençon,et al.  On Bootstrapping the ROC Curve , 2008, NIPS.

[9]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[12]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[13]  Jiri Matas,et al.  Fixing the Locally Optimized RANSAC , 2012, BMVC.

[14]  Foster J. Provost,et al.  Confidence Bands for Roc Curves , 2004, ROCAI.

[15]  Luc Van Gool,et al.  Assessing the significance of performance differences on the PASCAL VOC challenges via bootstrapping , 2013 .

[16]  Stéphan Clémençon,et al.  Nonparametric estimation of the precision-recall curve , 2009, ICML '09.

[17]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[18]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .