Relevance Judgments for Image Retrieval Evaluation

In this chapter, we review our experiences with the relevance judging process at ImageCLEF, using the medical retrieval task as a primary example. We begin with a historic perspective of the precursor to most modern retrieval evaluation campaigns, the Cranfield paradigm, as most modern system–based evaluation campaigns including ImageCLEF are modeled after it. We then briefly describe the stages in an evaluation campaign and provide details of the different aspects of the relevance judgment process. We summarize the recruitment process and describe the various systems used for judgment at ImageCLEF. We discuss the advantages and limitations of creating pools that are then judged by human experts. Finally, we discuss our experiences with the subjectivity of the relevance process and the relative robustness of the performance measures to variability in relevance judging.

[1]  Giorgio Maria Di Nunzio,et al.  DIRECT: A System for Evaluating Information Access Components of Digital Libraries , 2005, ECDL.

[2]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[3]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[4]  Craig MacDonald,et al.  Retrieval sensitivity under training using different measures , 2008, SIGIR '08.

[5]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[6]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[7]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[8]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems , 1962 .

[9]  Theodora Tsikrika,et al.  Overview of the WikipediaMM Task at ImageCLEF 2009 , 2009, CLEF.

[10]  Ellen M. Voorhees,et al.  Bias and the limits of pooling , 2006, SIGIR '06.

[11]  Patrick Ruch,et al.  Model Formulation: Advancing Biomedical Image Retrieval: Development and Analysis of a Test Collection , 2006, J. Am. Medical Informatics Assoc..

[12]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[13]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[14]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[15]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[16]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[17]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[18]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[19]  William R Hersh,et al.  Enhancing access to the Bibliome: the TREC 2004 Genomics Track , 2006, Journal of biomedical discovery and collaboration.

[20]  Mark Sanderson,et al.  The CLEF Cross Language Image Retrieval Track (ImageCLEF) 2004 , 2004, CLEF.

[21]  Henning Müller,et al.  Overview of the CLEF 2009 Medical Image Retrieval Track , 2009, CLEF.

[22]  Henning Müller,et al.  The ImageCLEFmed Medical Image Retrieval Task Test Collection , 2009, Journal of Digital Imaging.

[23]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .