SSD In-Storage Computing for Search Engines

SSD-based in-storage computing (called ”Smart SSDs”) allows application-specific codes to execute inside SSDs to exploit the high internal bandwidth and energy-efficient processors. As a result, Smart SSDs have been successfully deployed in many industry settings, e.g., Samsung, IBM, Teradata, and Oracle. Moreover, researchers have also demonstrated their potential opportunities in database systems, data mining, and big data processing. However, it remains unknown whether search engine systems can benefit from Smart SSDs. This work takes a first step to answer this question. The major research issue is what search engine query processing operations can be cost-effectively offloaded to SSDs. For this, we carefully identified the five most commonly used search engine operations that could potentially benefit from Smart SSDs: intersection, ranked intersection, ranked union, difference, and ranked difference. With close collaboration with Samsung, we offloaded the above five operations of Apache Lucene (a widely used open-source search engine) to Samsungs Smart SSD. Finally, we conducted extensive experiments to evaluate the system performance and tradeoffs by using both synthetic datasets and real datasets. The experimental results show that Smart SSDs significantly reduce the query latency by a factor of 2-3 and energy consumption by 6-10 for most of the aforementioned operations.

[1]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[2]  Chanik Park,et al.  Intelligent SSD: a turbo for big data mining , 2013, CIKM.

[3]  Christos Faloutsos,et al.  Active Storage for Large-Scale Data Mining and Multimedia , 1998, VLDB.

[4]  Gang Wang,et al.  The impact of solid state drive on search engine cache management , 2013, SIGIR.

[5]  Yang Liu,et al.  Willow: A User-Programmable SSD , 2014, OSDI.

[6]  Steven Swanson,et al.  Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[7]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[8]  G. Jack Lipovski,et al.  CASSM: a cellular system for very large data bases , 1975, VLDB '75.

[9]  Henry Tan,et al.  Maguro, a system for indexing and searching over very large text collections , 2013, WSDM.

[10]  David Hung-Chang Du,et al.  CFTL: a convertible flash translation layer adaptive to data access patterns , 2010, SIGMETRICS '10.

[11]  David A. Patterson,et al.  A case for intelligent disks (IDISKs) , 1998, SGMD.

[12]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[13]  Krishnamurthi Kannan,et al.  The design of a mass memory for a database computer , 1978, ISCA '78.

[14]  Gustavo Alonso,et al.  Ibex - An Intelligent Storage Engine with Support for Advanced SQL Off-loading , 2014, Proc. VLDB Endow..

[15]  Steven Swanson,et al.  Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[16]  Hans Christoph Zeidler,et al.  A Search Processor for Data Base Management Systems , 1978, VLDB.

[17]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[18]  Bolin Ding,et al.  Fast Set Intersection in Memory , 2011, Proc. VLDB Endow..

[19]  Jin-Soo Kim,et al.  Accelerating External Sorting via On-the-fly Data Merge in Active SSDs , 2014, HotStorage.

[20]  Joel H. Saltz,et al.  Active disks: programming model, algorithms and evaluation , 1998, ASPLOS VIII.

[21]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[22]  Rajesh Gupta,et al.  Minerva: Accelerating Data Analysis in Next-Generation SSDs , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[23]  David Hung-Chang Du,et al.  Hot data identification for flash-based storage systems using multiple bloom filters , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[24]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[25]  Yannis Papakonstantinou,et al.  Data Compression for Analytics over Large-scale In-memory Column Databases (Summary Paper) , 2016, ArXiv.

[26]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[27]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[28]  Jianguo Wang,et al.  In-Storage Computing for Hadoop MapReduce Framework: Challenges and Possibilities , 2016 .

[29]  David J. DeWitt,et al.  Query processing on smart SSDs: opportunities and challenges , 2013, SIGMOD '13.

[30]  Jignesh M. Patel,et al.  Energy management for MapReduce clusters , 2010, Proc. VLDB Endow..

[31]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[32]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[33]  Hugh E. Williams,et al.  Fast generation of result snippets in web search , 2007, SIGIR.

[34]  Yannis Papakonstantinou,et al.  SSD in-storage computing for list intersection , 2016, DaMoN '16.

[35]  Sang-Won Lee,et al.  Fast, Energy Efficient Scan inside Flash Memory , 2011, ADMS@VLDB.

[36]  Rajesh K. Gupta,et al.  Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[37]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..