Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search

This paper describes Ivory, an attempt to build a distributed retrieval system around the open-source Hadoop implementation of MapReduce. We focus on three noteworthy aspects of our work: a retrieval architecture built directly on the Hadoop Distributed File System (HDFS), a scalable MapReduce algorithm for inverted indexing, and webpage classification to enhance retrieval effectiveness.

[1]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[2]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[3]  Craig MacDonald,et al.  Comparing Distributed Indexing: To MapReduce or Not? , 2009, LSDS-IR@SIGIR.

[4]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[5]  GhemawatSanjay,et al.  The Google file system , 2003 .

[6]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[7]  Christophe Bisciglia,et al.  Cluster computing for web-scale data processing , 2008, SIGCSE '08.

[8]  Tim Leek,et al.  Probabilistic approaches to topic detection and tracking , 2002 .

[9]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[10]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[11]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[12]  Jimmy J. Lin Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[13]  Claudio Lucchese,et al.  7th workshop on large-scale distributed systems for information retrieval (LSDS-IR'09) , 2009, SIGF.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[16]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[17]  Donald Metzler,et al.  Beyond bags of words: effectively modeling dependence and features in information retrieval , 2008, SIGF.

[18]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[19]  Jimmy J. Lin,et al.  Exploring Large-Data Issues in the Curriculum: A Case Study with MapReduce , 2008 .