Reproducibility and Validity in CLEF

In this paper, we investigate CLEF’s contribution to the reproducibility of IR experiments. After discussing the concepts of reproducibility and validity, we show that CLEF has not only produced test collections that can be re-used by other researchers, but also undertaken various efforts in enabling reproducibility.

[1]  Julio Gonzalo,et al.  iCLEF 2004 Track Overview: Pilot Experiments in Interactive Cross-Language Question Answering , 2004, CLEF.

[2]  Tsvi Kuflik,et al.  The Dagstuhl Perspectives Workshop on Performance Modeling and Prediction , 2018, SIGF.

[3]  Noriko Kando,et al.  Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science" , 2016, SIGIR Forum.

[4]  Krisztian Balog,et al.  Extended Overview of the Living Labs for Information Retrieval Evaluation (LL4IR) CLEF Lab 2015 , 2015, CLEF.

[5]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[6]  Nicola Ferro,et al.  DIRECTions: Design and Specification of an IR Evaluation Infrastructure , 2012, CLEF.

[7]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[8]  Khalid Choukri,et al.  Information Filtering Evaluation: Overview of CLEF 2009 INFILE Track , 2009, CLEF.

[9]  Giuseppe Santucci,et al.  A Visual Analytics Approach for What-If Analysis of Information Retrieval Systems , 2016, SIGIR.

[10]  Ian H. Witten,et al.  Chapter 15 – Embedded Machine Learning , 2011 .

[11]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[12]  Paul Buitelaar,et al.  Semantic representation and enrichment of information retrieval experimental data , 2017, International Journal on Digital Libraries.

[13]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[14]  Gonzalo Navarro,et al.  Word-based self-indexes for natural language text , 2012, TOIS.

[15]  Martha Larson,et al.  CLEF 2017 NewsREEL Overview: Offline and Online Evaluation of Stream-based News Recommender Systems , 2017, CLEF.

[16]  Martin Braschler,et al.  CLEF 2001 - Overview of Results , 2001, CLEF.

[17]  Jimmy J. Lin,et al.  Reproducible Experiments on Lexical and Temporal Feedback for Tweet Search , 2015, ECIR.

[18]  Norbert Fuhr,et al.  Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.

[19]  Giorgio Maria Di Nunzio,et al.  DIRECT: A System for Evaluating Information Access Components of Digital Libraries , 2005, ECDL.

[20]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[21]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .