Experimental methods for information retrieval

Experimental evaluation plays a critical role in driving progress in information retrieval (IR) today. There are few alternative ways of exploring, in depth, the empirical merits (or lack thereof) of newly devised search techniques. Careful evaluation is necessary for advancing the state-of-theart; yet, many published papers present work that was illevaluated. Indeed, this phenomenon has garnered attention from the community recently, after the publication of a controversial, but eye-opening, study by Armstrong et al. that suggested ad hoc search quality has not meaningfully advanced since 1984 [1]. The authors noted that the root of the problem was generally lax evaluation methodologies (e.g., weak baselines, etc.). Furthermore, many submissions to top IR research venues (e.g., SIGIR, CIKM, ECIR, WSDM, WWW, etc.) are rejected primarily due to insufficient or inappropriate evaluation. There is therefore a strong need to educate students, researchers, and practitioners about the proper way to carry out IR experiments. This is unfortunately not something that is taught in IR courses or covered in IR textbooks. Indeed, to the best of our knowledge, there is very little written work that lays down some principles for running an IR experiment [3]. More specifically, there have not been any recent tutorials or written works that have specifically, and comprehensively, addressed the question of “how to run an IR experiment” in terms of effectiveness evaluation. This has potentially yielded a number of detrimental effects, as described above. The goal of the tutorial is to provide an initial set of training material for researchers interested in rigorous evaluations of information retrieval systems. Although the primary focus is on ad hoc retrieval experiments, the principles and concepts described in the tutorial are general and can easily be applied to a wide range of experimental scenarios both within, and beyond, the field of information retrieval. The tutorial is primarily inteded for graduate students, re-

[1]  John C. Henderson,et al.  Direct Maximization of Average Precision by Hill-Climbing, with a Comparison to a Maximum Entropy Approach , 2004, HLT-NAACL.

[2]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[3]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[4]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[5]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[6]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[7]  James Allan,et al.  A New Measure of the Cluster Hypothesis , 2009, ICTIR.

[8]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[9]  Jimmy J. Lin,et al.  Quantitative evaluation of passage retrieval algorithms for question answering , 2003, SIGIR.

[10]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[11]  Tetsuya Sakai Comparing metrics across TREC and NTCIR: the robustness to system bias , 2008, CIKM '08.

[12]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[13]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[14]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[15]  Kevyn Collins-Thompson Estimating Robust Query Models with Convex Optimization , 2008, NIPS.

[16]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[17]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Track. , 2004 .

[18]  C. J. van Rijsbergen,et al.  Investigating the relationship between language model perplexity and IR precision-recall measures , 2003, SIGIR.

[19]  Alistair Moffat,et al.  What Does It Mean to "Measure Performance"? , 2004, WISE.

[20]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[21]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[22]  Donna K. Harman,et al.  The NRRC reliable information access (RIA) workshop , 2004, SIGIR '04.

[23]  Charles L. A. Clarke,et al.  The effect of document retrieval quality on factoid question answering performance , 2004, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[24]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[25]  Tetsuya Sakai,et al.  Flexible pseudo-relevance feedback via selective sampling , 2005, TALIP.

[26]  James Allan,et al.  Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes , 2009, SIGIR.

[27]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[28]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[29]  ChengXiang Zhai,et al.  Adaptive relevance feedback in information retrieval , 2009, CIKM.

[30]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[31]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[32]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[33]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[34]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[35]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.