Large-scale information retrieval experimentation with terrier

This tutorial aims to provide a practical introduction to conducting large-scale information retrieval (IR) experiments, using Terrier (http://terrier.org) as an experimentation platform. Written in Java, Terrier provides an open-source, feature-rich, flexible, and robust environment for large-scale IR experimentation. This tutorial will cover the experimentation process end-to-end, from configuring Terrier to a particular experimental setting, to efficiently indexing a document corpus and retrieving from it, and to evaluating the outcome. Moreover, it will describe how to use and extend the platform to one's own needs, and will be illustrated by practical research-driven examples. As a half-day tutorial, it will be split into two major sessions, with each session comprising both background information and practical demonstrations. In the first session, we will provide an overview of several aspects of large-scale IR experimentation, spanning areas such as indexing, data structures, query languages, and advanced retrieval models, and how these are implemented within Terrier. In the second session, we will discuss how to extend Terrier to conduct one's own experiments in a large-scale setting, including how to facilitate the evaluation of non-standard IR tasks through crowdsourcing. The practical demonstrations will cover recent use cases identified from Terrier's online discussion forum, so as to provide attendees with concrete examples of what can be done within Terrier.

[1]  Iadh Ounis,et al.  Incorporating term dependency in the dfr framework , 2007, SIGIR.

[2]  Craig MacDonald,et al.  Comparing Distributed Indexing: To MapReduce or Not? , 2009, LSDS-IR@SIGIR.

[3]  Giorgio Gambosi,et al.  FUB, IASI-CNR and University of Tor Vergata at TREC 2008 Blog Track , 2008, TREC.

[4]  R. McCreadie Crowdsourcing Blog Track Top News Judgments at TREC , 2011 .

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Craig MacDonald,et al.  University of Glasgow at WebCLEF 2005: Experiments in per-field Normalisation and Language Specific Stemming , 2005, CLEF.

[7]  Gerald B. Folland,et al.  Other References , 1965, Comparative Education Review.

[8]  Craig MacDonald,et al.  On single-pass indexing with MapReduce , 2009, SIGIR.

[9]  Craig MacDonald,et al.  Integrating Proximity to Subjective Sentences for Blog Opinion Retrieval , 2009, ECIR.

[10]  Bahar Karaoglan,et al.  IRRA at TREC 2009: Index Term Weighting Based on Divergence From Independence Model , 2009, TREC.

[11]  Iadh Ounis,et al.  Research directions in Terrier: a search engine for advanced retrieval on the Web , 2007 .

[12]  Ben He,et al.  Terrier : A High Performance and Scalable Information Retrieval Platform , 2022 .

[13]  Craig MacDonald,et al.  MapReduce indexing strategies: Studying scalability and efficiency , 2012, Inf. Process. Manag..

[14]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[15]  Gianni Amati,et al.  Probability models for information retrieval based on divergence from randomness , 2003 .

[16]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.