On the feasibility of multi-site web search engines

Web search engines are often implemented as centralized systems. Designing and implementing a Web search engine in a distributed environment is a challenging engineering task that encompasses many interesting research questions. However, distributing a search engine across multiple sites has several advantages, such as utilizing less compute resources and exploiting data locality. In this paper we investigate the cost-effectiveness of building a distributed Web search engine. We propose a model for assessing the total cost of a distributed Web search engine that includes the computational costs and the communication cost among all distributed sites. We then present a query-processing algorithm that maximizes the amount of queries answered locally, without sacrificing the quality of the results compared to a centralized search engine. We simulate the algorithm on real document collections and query workloads to measure the actual parameters needed for our cost model, and we show that a distributed search engine can be competitive compared to a centralized architecture with respect to real cost.

[1]  José Rufino,et al.  Geographical partition for distributed web crawling , 2005, GIR '05.

[2]  John H. Seader,et al.  Tier Classifications Define Site Infrastructure Performance , 2006 .

[3]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Alexandros Ntoulas,et al.  Pruning policies for two-tiered inverted index with correctness guarantee , 2007, SIGIR.

[6]  Steven Garcia,et al.  Access-Ordered Indexes , 2004, ACSC.

[7]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[8]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[9]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Alistair Moffat,et al.  A pipelined architecture for distributed text query evaluation , 2007, Information Retrieval.

[11]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[12]  Fabrizio Silvestri,et al.  Design of a Parallel and Distributed Web Search Engine , 2004, ArXiv.

[13]  Knut Magne Risvik,et al.  Multi-tier architecture for Web search engines , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[14]  Eric A. Brewer,et al.  Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[15]  Hector Garcia-Molina,et al.  Query processing and inverted indices in shared-nothing text document information retrieval systems , 1993, The VLDB Journal.

[16]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[17]  Alistair Moffat,et al.  Performance and Cost Tradeoffs in Web Search , 2004, ADC.

[18]  Alistair Moffat,et al.  Load balancing for term-distributed parallel retrieval , 2006, SIGIR.

[19]  Aristides Gionis,et al.  The impact of caching on search engines , 2007, SIGIR.

[20]  Gerhard Weikum,et al.  Design Alternatives for Large-Scale Web Search: Alexander was Great, Aeneas a Pioneer, and Anakin has the Force , 2007 .

[21]  David R. Karger,et al.  On the Feasibility of Peer-to-Peer Web Indexing and Search , 2003, IPTPS.

[22]  Ricardo A. Baeza-Yates,et al.  Analyzing imbalance among homogeneous index servers in a web search system , 2007, Inf. Process. Manag..

[23]  Berkant Barla Cambazoglu,et al.  Quantifying performance and quality gains in distributed web search engines , 2009, SIGIR.

[24]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[25]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[26]  Dmitri Loguinov,et al.  IRLbot: Scaling to 6 billion pages and beyond , 2009, TWEB.

[27]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[28]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[29]  Torsten Suel,et al.  Efficient query evaluation on large textual collections in a peer-to-peer environment , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).

[30]  Fabrizio Silvestri,et al.  Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data , 2006, TOIS.

[31]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[32]  Kenneth Ward Church,et al.  On Delivering Embarrassingly Distributed Cloud Services , 2008, HotNets.

[33]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[34]  Stefan Savage,et al.  Modeling TCP latency , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).