How to Measure the Reproducibility of System-oriented IR Experiments

Replicability and reproducibility of experimental results are primary concerns in all the areas of science and IR is not an exception. Besides the problem of moving the field towards more reproducible experimental practices and protocols, we also face a severe methodological issue: we do not have any means to assess when reproduced is reproduced. Moreover, we lack any reproducibility-oriented dataset, which would allow us to develop such methods. To address these issues, we compare several measures to objectively quantify to what extent we have replicated or reproduced a system-oriented IR experiment. These measures operate at different levels of granularity, from the fine-grained comparison of ranked lists, to the more general comparison of the obtained effects and significant differences. Moreover, we also develop a reproducibility-oriented dataset, which allows us to validate our measures and which can also be used to develop future measures.

[1]  Norbert Fuhr,et al.  Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.

[2]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[3]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[4]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[5]  Alistair Moffat,et al.  Precision-at-ten considered redundant , 2008, SIGIR '08.

[6]  Nicola Ferro,et al.  What Does Affect the Correlation Among Evaluation Measures? , 2017, ACM Trans. Inf. Syst..

[7]  Catherine Dehon,et al.  Influence functions of the Spearman and Kendall correlation measures , 2010, Stat. Methods Appl..

[8]  Matt Crane,et al.  Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results , 2018, TACL.

[9]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[10]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[11]  Andrew Trotman,et al.  Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) , 2016, SIGF.

[12]  Tetsuya Sakai,et al.  Overview of CENTRE@CLEF 2018: A First Tale in the Systematic Reproducibility Realm , 2018, CLEF.

[13]  Jimmy J. Lin,et al.  Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019) , 2019, OSIRRC@SIGIR.

[14]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[15]  David De Roure,et al.  The future of scholarly communications , 2014 .

[16]  Nicola Ferro,et al.  Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness , 2015, ICTIR.

[17]  Tetsuya Sakai,et al.  Overview of CENTRE@CLEF 2019: Sequel in the Systematic Reproducibility Realm , 2019, CLEF.

[18]  Nicola Ferro,et al.  SIGIR Initiative to Implement ACM Artifact Review and Badging , 2018, SIGF.

[19]  Nicola Ferro,et al.  Introduction to the Special Issue on Reproducibility in Information Retrieval: Tools and Infrastructures , 2018, JDIQ.

[20]  Hans Ekkehard Plesser,et al.  Reproducibility vs. Replicability: A Brief History of a Confused Terminology , 2018, Front. Neuroinform..

[21]  Norbert Fuhr Reproducibility and Validity in CLEF , 2019, Information Retrieval Evaluation in a Changing World.

[22]  Dietmar Jannach,et al.  Are we really making much progress? A worrying analysis of recent neural recommendation approaches , 2019, RecSys.

[23]  Craig MacDonald,et al.  Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge , 2016, ECIR.

[24]  Tetsuya Sakai Two Sample T-tests for IR Evaluation: Student or Welch? , 2016, SIGIR.

[25]  Tetsuya Sakai,et al.  Laboratory Experiments in Information Retrieval , 2018, The Information Retrieval Series.

[26]  Division on Earth,et al.  Reproducibility and Replicability in Science , 2019 .

[27]  Nicola Ferro,et al.  Introduction to the Special Issue on Reproducibility in Information Retrieval , 2018, ACM J. Data Inf. Qual..

[28]  Michelle Schwalbe,et al.  Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results: Summary of a Workshop , 2016 .

[29]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[30]  Maura R. Grossman,et al.  MRG_UWaterloo and WaterlooCormack Participation in the TREC 2017 Common Core Track , 2017, TREC.

[31]  Timo Breuer,et al.  Replicability and Reproducibility of Automatic Routing Runs , 2019, CLEF.

[32]  Mark Sanderson,et al.  Problems with Kendall's tau , 2007, SIGIR.