Diversified caching for replicated web search engines

Commercial web search engines adopt parallel and replicated architecture in order to support high query throughput. In this paper, we investigate the effect of caching on the throughput in such a setting. A simple scheme, called uniform caching, would replicate the cache content to all servers. Unfortunately, it does not exploit the variations among queries, thus wasting memory space on caching the same cache content redundantly on multiple servers. To tackle this limitation, we propose a diversified caching problem, which aims to diversify the types of queries served by different servers, and maximize the sharing of terms among queries assigned to the same server. We show that it is NP-hard to find the optimal diversified caching scheme, and identify intuitive properties to seek good solutions. Then we present a framework with a suite of techniques and heuristics for diversified caching. Finally, we evaluate the proposed solution with competitors by using a real dataset and a real query log.

[1]  Gang Wang,et al.  The impact of solid state drive on search engine cache management , 2013, SIGIR.

[2]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[3]  Ophir Frieder,et al.  RESQ: rank-energy selective query forwarding for distributed search systems , 2012, CIKM '12.

[4]  Özgür Ulusoy,et al.  Static query result caching revisited , 2008, WWW.

[5]  Berthier A. Ribeiro-Neto,et al.  Basic issues on the processing of web queries , 2005, SIGIR '05.

[6]  Ricardo A. Baeza-Yates,et al.  A Three Level Search Engine Index Based in Query Log Distribution , 2003, SPIRE.

[7]  Goetz Graefe,et al.  Query processing techniques for solid state drives , 2009, SIGMOD Conference.

[8]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[9]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[10]  Veronica Gil Costa,et al.  New caching techniques for web search engines , 2010, HPDC '10.

[11]  Alistair Moffat,et al.  Load balancing for term-distributed parallel retrieval , 2006, SIGIR.

[12]  Aristides Gionis,et al.  The impact of caching on search engines , 2007, SIGIR.

[13]  Berkant Barla Cambazoglu,et al.  Effect of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems , 2006, ISCIS.

[14]  Udi Manber,et al.  Experience with personalization of Yahoo! , 2000, CACM.

[15]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[16]  Hector Garcia-Molina,et al.  Performance of Inverted Indices in Distributed Text Document Retrieval Systems , 1993 .

[17]  Aristides Gionis,et al.  Design trade-offs for search engine caching , 2008, TWEB.

[18]  Craig MacDonald,et al.  Load-sensitive selective pruning for distributed search , 2013, CIKM.

[19]  Byeong-Soo Jeong,et al.  Inverted File Partitioning Schemes in Multiple Disk Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[20]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[22]  Iadh Ounis,et al.  A case study of distributed information retrieval architectures to index one terabyte of text , 2005, Inf. Process. Manag..

[23]  Özgür Ulusoy,et al.  Cost-Aware Strategies for Query Result Caching in Web Search Engines , 2011, TWEB.

[24]  Torsten Suel,et al.  Improved techniques for result caching in web search engines , 2009, WWW '09.

[25]  Fabrizio Silvestri,et al.  Load-balancing and caching for collection selection architectures , 2007, Infoscale.

[26]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[27]  Raffaele Perego,et al.  Load-balancing and caching for collection selection architectures , 2007 .

[28]  Jeffrey Xu Yu,et al.  Catch the Wind: Graph workload balancing on cloud , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[29]  Philip A. Bernstein,et al.  Adapting microsoft SQL server for cloud computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[30]  Hugh E. Williams,et al.  Fast generation of result snippets in web search , 2007, SIGIR.

[31]  Jin Li,et al.  FlashStore , 2010, Proc. VLDB Endow..

[32]  Berkant Barla Cambazoglu,et al.  A term-based inverted index partitioning model for efficient distributed query processing , 2013, TWEB.

[33]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[34]  Neoklis Polyzotis,et al.  Divergent physical design tuning for replicated databases , 2012, SIGMOD Conference.

[35]  Torsten Suel,et al.  Three-Level Caching for Efficient Query Processing in Large Web Search Engines , 2005, WWW '05.

[36]  Shamkant B. Navathe,et al.  Vertical partitioning for database design: a graphical algorithm , 1989, SIGMOD '89.

[37]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[38]  Iadh Ounis,et al.  Performance Comparison of Clustered and Replicated Information Retrieval Systems , 2007, ECIR.

[39]  Fabrizio Silvestri,et al.  Caching query-biased snippets for efficient retrieval , 2011, EDBT/ICDT '11.