Online result cache invalidation for real-time web search

Caches of results are critical components of modern Web search engines, since they enable lower response time to frequent queries and reduce the load to the search engine backend. Results in long-lived cache entries may become stale, however, as search engines continuously update their index to incorporate changes to the Web. Consequently, it is important to provide mechanisms that control the degree of staleness of cached results, ideally enabling the search engine to always return fresh results. In this paper, we present a new mechanism that identifies and invalidates query results that have become stale in the cache online. The basic idea is to evaluate at query time and against recent changes if cache hits have had their results have changed. For enhancing invalidation efficiency, the generation time of cached queries and their chronological order with respect to the latest index update are used to early prune unaffected queries. We evaluate the proposed approach using documents that change over time and query logs of the Yahoo! search engine. We show that the proposed approach ensures good query results (50% fewer stale results) and high invalidation accuracy (90% fewer unnecessary invalidations) compared to a baseline approach that makes invalidation decisions off-line. More importantly, the proposed approach induces less processing overhead, ensuring an average throughput 73% higher than that of the baseline approach.

[1]  Berkant Barla Cambazoglu,et al.  A refreshing perspective of search engine caching , 2010, WWW '10.

[2]  Torsten Suel,et al.  Three-level caching for efficient query processing in large Web search engines , 2005, WWW.

[3]  Anand Sivasubramaniam,et al.  A Hybrid Cache and Prefetch Mechanism for Scientific Literature Search Engines , 2007, ICWE.

[4]  Edith Cohen,et al.  Refreshment policies for Web content caches , 2002, Comput. Networks.

[5]  Ronny Lempel,et al.  Caching for Realtime Search , 2011, ECIR.

[6]  Aristides Gionis,et al.  The impact of caching on search engines , 2007, SIGIR.

[7]  Shlomo Moran,et al.  Predictive caching and prefetching of query results in search engines , 2003, WWW '03.

[8]  Charles L. A. Clarke,et al.  Hybrid index maintenance for growing text collections , 2006, SIGIR.

[9]  Hugh E. Williams,et al.  In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems , 2004, ACSC.

[10]  Torsten Suel,et al.  Improved techniques for result caching in web search engines , 2009, WWW '09.

[11]  Edith Cohen,et al.  Refreshment policies for Web content caches , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[12]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[13]  Jan O. Pedersen,et al.  Optimization for dynamic inverted index maintenance , 1989, SIGIR '90.

[14]  Hans Friedrich Witschel,et al.  Admission Policies for Caches of Search Engine Results , 2007, SPIRE.

[15]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[16]  Shlomo Moran,et al.  Optimizing Result Prefetching in Web Search Engines with Segmented Indices , 2002, VLDB.

[17]  Özgür Ulusoy,et al.  Timestamp-based result cache invalidation for web search engines , 2011, SIGIR.

[18]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[19]  Aristides Gionis,et al.  Design trade-offs for search engine caching , 2008, TWEB.

[20]  Justin Zobel,et al.  Dynamic index pruning for effective caching , 2007, CIKM '07.