Overview of the NTCIR-16 We Want Web with CENTRE (WWW-4) Task

This is an overview of the NTCIR-16 We Want Web with CENTRE (WWW-4) task, the fourth round of an evaluation series that aims to quantify the progress and reproducibility of web search algorithms in offline ad hoc retrieval settings. For WWW-4, we introduced a new English web corpus, which we named Chuweb21. Moreover, in addition to bronze relevance assessments (i.e., those given by assessors who are neither topic creators nor topic experts), we collected gold relevance assessments (i.e., those given by topic creators). We received 18 runs from 4 teams, including two runs from the organiser team. We describe the task, data, evaluation measures, and report on the official evaluation results.

[1]  T. Sakai,et al.  Relevance Assessments for Web Search Evaluation: Should We Randomise or Prioritise the Pooled Documents? , 2022, ACM Trans. Inf. Syst..

[2]  Jingtao Zhan THUIR at the NTCIR-16 WWW-4 Task , 2022 .

[3]  T. Sakai,et al.  SLWWW at the NTCIR-16 WWW-4 Task , 2022 .

[4]  Philipp Schaer,et al.  repro_eval: A Python Interface to Reproducibility Measures of System-Oriented IR Experiments , 2022, ECIR.

[5]  T. Sakai,et al.  Retrieval Evaluation Measures that Agree with Users’ SERP Preferences , 2020, ACM Trans. Inf. Syst..

[6]  Tetsuya Sakai,et al.  How to Measure the Reproducibility of System-oriented IR Experiments , 2020, SIGIR.

[7]  Zhicheng Dou,et al.  Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task , 2020 .

[8]  Zhaohao Zeng,et al.  SLWWW at the NTCIR-15 WWW-3 Task , 2020 .

[9]  Jimmy J. Lin,et al.  Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval , 2019, EMNLP.

[10]  Tetsuya Sakai,et al.  Which Diversity Evaluation Measures Are "Good"? , 2019, SIGIR.

[11]  Tetsuya Sakai,et al.  How to Run an Evaluation Task - With a Primary Focus on Ad Hoc Information Retrieval , 2019, Information Retrieval Evaluation in a Changing World.

[12]  Jimmy J. Lin,et al.  Anserini: Reproducible Ranking Baselines Using Lucene , 2018, ACM J. Data Inf. Qual..

[13]  Tetsuya Sakai,et al.  Overview of CENTRE@CLEF 2018: A First Tale in the Systematic Reproducibility Realm , 2018, CLEF.

[14]  Makoto P. Kato,et al.  Overview of NTCIR-13 , 2017, NTCIR.

[15]  Cheng Luo,et al.  Overview of the NTCIR-13 We Want Web Task , 2017, NTCIR.

[16]  Nicola Ferro,et al.  Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness , 2015, ICTIR.

[17]  Tetsuya Sakai,et al.  Metrics, Statistics, Tests , 2013, PROMISE Winter School.

[18]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[19]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[20]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[21]  Makoto P. Kato,et al.  KASYS at the NTCIR-16 WWW-4 Task , 2022 .

[22]  Kohei Shinden,et al.  KASYS at the NTCIR-15 WWW-3 Task , 2022 .