MapReduce for Information Retrieval Evaluation: "Let's Quickly Test This on 12 TB of Data"

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Philippe Mulhem,et al.  LIG at ImageCLEF 2008, Evaluating Systems for Multilingual and Multimodal Information Access , 2008 .

[3]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[4]  Jimmy J. Lin Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[5]  Xie Kanglin Lucene Search Engine , 2007 .

[6]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[7]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[8]  Carol Peters,et al.  Evaluating Systems for Multilingual and Multimodal Information Access, 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers , 2009, CLEF.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Gerard Salton,et al.  Parallel text search methods , 1988, CACM.

[11]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[12]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[13]  Stephen E. Robertson,et al.  Microsoft Research at TREC 2009: Web and Relevance Feedback Track , 2009, TREC.

[14]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.