Exploiting query term correlation for list caching in web search engines

Caching technologies have been widely employed to boost the performance of Web search engines. Motivated by the correlation between terms in query logs from a commercial search engine, we explore the idea of a caching scheme based on pairs of terms, rather than individual terms (which is the typical approach used by search engines today). We propose an inverted list caching policy, based on the Least Recently Used method, in which the co-occurring correlation between terms in the query stream is accounted for when deciding on which terms to keep in the cache. We consider not only the term co-occurrence within the same query but also the co-occurrence between separate queries. Experimental results show that the proposed approach can improve not only the cache hit ratio but also the overall throughput of the system when compared to existing list caching algorithms.

[1]  Erik D. Demaine,et al.  Experiments on Adaptive Set Intersections for Text Retrieval Systems , 2001, ALENEX.

[2]  Aristides Gionis,et al.  The impact of caching on search engines , 2007, SIGIR.

[3]  Torsten Suel,et al.  Three-level caching for efficient query processing in large Web search engines , 2005, WWW.

[4]  Hans Friedrich Witschel,et al.  Admission Policies for Caches of Search Engine Results , 2007, SPIRE.

[5]  Yan Lu,et al.  Characteristics of character usage in Chinese Web searching , 2009, Inf. Process. Manag..

[6]  Özgür Ulusoy,et al.  A five-level static cache architecture for web search engines , 2012, Inf. Process. Manag..

[7]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[8]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[9]  Torsten Suel,et al.  Improved techniques for result caching in web search engines , 2009, WWW '09.

[10]  Ricardo Baeza-Yates,et al.  Modeling Static Caching in Web Search Engines , 2012, ECIR.

[11]  Kenneth Ward Church,et al.  Heavy-tailed distributions and multi-keyword queries , 2007, SIGIR.

[12]  Ricardo A. Baeza-Yates,et al.  A Three Level Search Engine Index Based in Query Log Distribution , 2003, SPIRE.

[13]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[14]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.