Quantifying performance and quality gains in distributed web search engines

Distributed search engines based on geographical partitioning of a central Web index emerge as a feasible solution to the immense growth of the Web, user bases, and query traffic. However, there is still lack of research in quantifying the performance and quality gains that can be achieved by such architectures. In this paper, we develop various cost models to evaluate the performance benefits of a geographically distributed search engine architecture based on partial index replication and query forwarding. Specifically, we focus on possible performance gains due to the distributed nature of query processing and Web crawling processes. We show that any response time gain achieved by distributed query processing can be utilized to improve search relevance as the use of complex but more accurate algorithms can now be enabled for document ranking. We also show that distributed Web crawling leads to better Web coverage and try to see if this improves the search quality. We verify the validity of our claims over large, real-life datasets via simulations.

[1]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[2]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[3]  B. Huffaker,et al.  Distance Metrics in the Internet , 2002, Anais do 2002 International Telecommunications Symposium.

[4]  Gurmeet Singh Manku,et al.  SETS: search enhanced by topic segmentation , 2003, SIGIR.

[5]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[6]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[7]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[8]  Aristides Gionis,et al.  On the feasibility of multi-site web search engines , 2009, CIKM.

[9]  Ricardo Baeza-Yates,et al.  A Study of the Impact of Index Updates on Distributed Query Processing for Web Search , 2009, ECIR.

[10]  Donna Harman,et al.  The use of statistical ranking to retrieve records from a gigabyte of text , 1990 .

[11]  Berkant Barla Cambazoglu,et al.  On the feasibility of geographically distributed web crawling , 2008, Infoscale.

[12]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[14]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[15]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[16]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[17]  Zhichen Xu,et al.  PeerSearch: Efficient Information Retrieval in Peer-to-Peer Networks , 2002 .