Cache Design of SSD-Based Search Engine Architectures: An Experimental Study

Caching is an important optimization in search engine architectures. Existing caching techniques for search engine optimization are mostly biased towards the reduction of random accesses to disks, because random accesses are known to be much more expensive than sequential accesses in traditional magnetic hard disk drive (HDD). Recently, solid-state drive (SSD) has emerged as a new kind of secondary storage medium, and some search engines like Baidu have already used SSD to completely replace HDD in their infrastructure. One notable property of SSD is that its random access latency is comparable to its sequential access latency. Therefore, the use of SSDs to replace HDDs in a search engine infrastructure may void the cache management of existing search engines. In this article, we carry out a series of empirical experiments to study the impact of SSD on search engine cache management. Based on the results, we give insights to practitioners and researchers on how to adapt the infrastructure and caching policies for SSD-based search engines.

[1]  Torsten Suel,et al.  Improved techniques for result caching in web search engines , 2009, WWW '09.

[2]  Ricardo Baeza-Yates,et al.  Modeling Static Caching in Web Search Engines , 2012, ECIR.

[3]  Hugh E. Williams,et al.  Fast generation of result snippets in web search , 2007, SIGIR.

[4]  Jin Li,et al.  FlashStore: High Throughput Persistent Key-Value Store , 2010, Proc. VLDB Endow..

[5]  Hiroshi Motoda,et al.  A Flash-Memory Based File System , 1995, USENIX.

[6]  Jae-Myung Kim,et al.  A case for flash memory ssd in enterprise database applications , 2008, SIGMOD Conference.

[7]  Fabrizio Silvestri,et al.  Caching query-biased snippets for efficient retrieval , 2011, EDBT/ICDT '11.

[8]  Xiaodong Zhang,et al.  Understanding intrinsic characteristics and system implications of flash memory based solid state drives , 2009, SIGMETRICS '09.

[9]  Hai Jin,et al.  An Efficient SSD-based Hybrid Storage Architecture for Large-Scale Search Engines , 2012, 2012 41st International Conference on Parallel Processing.

[10]  Torsten Suel,et al.  Three-level caching for efficient query processing in large Web search engines , 2005, WWW.

[11]  Michael M. Swift,et al.  FlashVM: Revisiting the Virtual Memory Hierarchy , 2009, HotOS.

[12]  Fabrizio Silvestri,et al.  Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data , 2006, TOIS.

[13]  Bingsheng He,et al.  Tree indexing on solid state drives , 2010, Proc. VLDB Endow..

[14]  László Böszörményi,et al.  A survey of Web cache replacement strategies , 2003, CSUR.

[15]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[16]  Goetz Graefe,et al.  The Five-Minute Rule 20 Years Later: and How Flash Memory Changes the Rules , 2008, ACM Queue.

[17]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[18]  Özgür Ulusoy,et al.  Static query result caching revisited , 2008, WWW.

[19]  Bojun Huang,et al.  Allocating inverted index into flash memory for search engines , 2011, WWW.

[20]  Özgür Ulusoy,et al.  A five-level static cache architecture for web search engines , 2012, Inf. Process. Manag..

[21]  Özgür Ulusoy,et al.  Cost-Aware Strategies for Query Result Caching in Web Search Engines , 2011, TWEB.

[22]  Gang Wang,et al.  The impact of solid state drive on search engine cache management , 2013, SIGIR.

[23]  Ramesh K. Sitaraman,et al.  Lazy-Adaptive Tree: An Optimized Index Structure for Flash Devices , 2009, Proc. VLDB Endow..

[24]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[25]  Ricardo A. Baeza-Yates,et al.  A Three Level Search Engine Index Based in Query Log Distribution , 2003, SPIRE.

[26]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX Annual Technical Conference.

[27]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[28]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[29]  Özgür Ulusoy,et al.  Second Chance: A Hybrid Approach for Dynamic Result Caching in Search Engines , 2011, ECIR.

[30]  Veronica Gil Costa,et al.  New caching techniques for web search engines , 2010, HPDC '10.

[31]  Sang-Won Lee,et al.  Design of flash-based DBMS: an in-page logging approach , 2007, SIGMOD '07.

[32]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[33]  Sandy Irani,et al.  Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.

[34]  Euiseong Seo,et al.  Empirical Analysis on Energy Efficiency of Flash-based SSDs , 2008, HotPower.

[35]  Sang-Won Lee,et al.  B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives , 2011, Proc. VLDB Endow..

[36]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[37]  Aristides Gionis,et al.  The impact of caching on search engines , 2007, SIGIR.

[38]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[39]  Suman Nath,et al.  Rethinking Database Algorithms for Phase Change Memory , 2011, CIDR.

[40]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[41]  Sivan Toledo,et al.  Algorithms and data structures for flash memories , 2005, CSUR.

[42]  Hugh E. Williams,et al.  Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.

[43]  Goetz Graefe,et al.  Fast scans and joins using flash drives , 2008, DaMoN '08.

[44]  Suman Nath,et al.  Online maintenance of very large random samples on flash storage , 2009, The VLDB Journal.

[45]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[46]  Aristides Gionis,et al.  Design trade-offs for search engine caching , 2008, TWEB.

[47]  Alistair Moffat,et al.  In Search of Reliable Retrieval Experiments , 2005 .

[48]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[49]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[50]  Özgür Ulusoy,et al.  A Cost-Aware Strategy for Query Result Caching in Web Search Engines , 2009, ECIR.

[51]  José González,et al.  Distributed Cooperative Caching , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[52]  Rudolf Bayer,et al.  A database cache for high performance and fast restart in database systems , 1984, TODS.

[53]  Eric Kralicek Data Storage Technologies and User Data Strategy , 2016 .

[54]  Song Jiang,et al.  LIRS: an efficient low inter-reference recency set replacement policy to improve buffer cache performance , 2002, SIGMETRICS '02.

[55]  Goetz Graefe,et al.  Query processing techniques for solid state drives , 2009, SIGMOD Conference.

[56]  Joonwon Lee,et al.  CFLRU: a replacement algorithm for flash memory , 2006, CASES '06.

[57]  Antony I. T. Rowstron,et al.  Migrating server storage to SSDs: analysis of tradeoffs , 2009, EuroSys '09.

[58]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[59]  Ruixuan Li,et al.  Efficient Online Index Maintenance for SSD-based Information Retrieval Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.