Query forwarding in geographically distributed search engines

Query forwarding is an important technique for preserving the result quality in distributed search engines where the index is geographically partitioned over multiple search sites. The key component in query forwarding is the thresholding algorithm by which the forwarding decisions are given. In this paper, we propose a linear-programming-based thresholding algorithm that significantly outperforms the current state-of-the-art in terms of achieved search efficiency values. Moreover, we evaluate a greedy heuristic for partial index replication and investigate the impact of result cache freshness on query forwarding performance. Finally, we present some optimizations that improve the performance further, under certain conditions. We evaluate the proposed techniques by simulations over a real-life setting, using a large query log and a document collection obtained from Yahoo!.

[1]  Ricardo Baeza-Yates,et al.  The Geographical Life of Search , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[2]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[3]  Torsten Suel,et al.  Improved techniques for result caching in web search engines , 2009, WWW '09.

[4]  Ricardo Baeza-Yates,et al.  Efficiency trade-offs in two-tier web search systems , 2009, SIGIR.

[5]  Aristides Gionis,et al.  On the feasibility of multi-site web search engines , 2009, CIKM.

[6]  B. Huffaker,et al.  Distance Metrics in the Internet , 2002, Anais do 2002 International Telecommunications Symposium.

[7]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Berkant Barla Cambazoglu,et al.  Quantifying performance and quality gains in distributed web search engines , 2009, SIGIR.

[9]  Zhichen Xu,et al.  PeerSearch: Efficient Information Retrieval in Peer-to-Peer Networks , 2002 .

[10]  Berkant Barla Cambazoglu,et al.  A refreshing perspective of search engine caching , 2010, WWW '10.

[11]  Sergei Vassilvitskii,et al.  Top-k aggregation using intersections of ranked inputs , 2009, WSDM '09.

[12]  Torsten Suel,et al.  Three-level caching for efficient query processing in large Web search engines , 2005, WWW.

[13]  Berkant Barla Cambazoglu,et al.  On the feasibility of geographically distributed web crawling , 2008, Infoscale.

[14]  Aristides Gionis,et al.  The impact of caching on search engines , 2007, SIGIR.

[15]  Shlomo Moran,et al.  Predictive caching and prefetching of query results in search engines , 2003, WWW '03.

[16]  Dimitrios Gunopulos,et al.  Answering top-k queries using views , 2006, VLDB.