论文信息 - MIREX: MapReduce Information Retrieval Experiments

MIREX: MapReduce Information Retrieval Experiments

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://sourceforge.net/projects/mirex/

Djoerd Hiemstra | Claudia Hauff

[1] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[2] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3] José Luis Vicedo González,et al. TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[4] Stephen E. Robertson,et al. Microsoft Research at TREC 2009: Web and Relevance Feedback Track , 2009, TREC.

[5] Michael Isard,et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[6] Djoerd Hiemstra,et al. Using language models for information retrieval , 2001 .

[7] Jimmy J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[8] Jeffrey Dean,et al. Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.