Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine

We propose two-dimensional indexing—a novel in-memory indexing architecture that operates over distributed memory of a massively-parallel search engine. The goal of two-dimensional indexing is to provide a one-integrated-memory view as in a single node system using one large integrated memory. In two-dimensional indexing, we partition the entire index into n× m fragments and distribute them over the memories of multiple nodes in such a way that each fragment is entirely stored in main memory of one node. The proposed architecture is not only scalable as it uses a scaled-out shared-nothing architecture but also is capable of achieving low query response time as it processes queries in main memory. We also propose the concept of the one-memory point, which is the amount of the memory space required to completely store the entire index in main memory providing a one-integrated-memory view. We first prove the effectiveness of two-dimensional indexing with single-keyword queries, and then, extend the notion so as to be able to handle multiple-keyword queries. To handle multiple-keyword queries, we adopt pre-join that materializes a multiple-keyword query a priori as well as a new notion of semi-memory join that obviates extensive communication overhead to perform join across multiple nodes. In experiments using the real-life search query set over a database consisting of 100 million Web documents crawled, we show that two-dimensional indexing can effectively provide a one-integrated-memory view without too much of additional memory compared with the single node system using one large integrated memory. We also show that, with a six-node prototype, in an ideal case, it significantly improves the query processing performance over a disk-based search engine with an equivalent amount of in-memory buffer but without two-dimensional indexing — by up to 535.54 times. This improvement is expected to get larger as the system is scaled-out with a larger number of machines.

[1]  Michael Stonebraker,et al.  The VoltDB Main Memory DBMS , 2013, IEEE Data Eng. Bull..

[2]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[3]  Jan O. Pedersen,et al.  Optimization for dynamic inverted index maintenance , 1989, SIGIR '90.

[4]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[5]  Torsten Suel,et al.  Improved techniques for result caching in web search engines , 2009, WWW '09.

[6]  Jae-Gil Lee,et al.  DB-IR integration using tight-coupling in the Odysseus DBMS , 2013, World Wide Web.

[7]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[8]  Fabrizio Silvestri,et al.  Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data , 2006, TOIS.

[9]  Ricardo Baeza-Yates,et al.  ResIn: a combination of results caching and index pruning for high-performance web search engines , 2008, SIGIR '08.

[10]  Philip A. Bernstein,et al.  Using Semi-Joins to Solve Relational Queries , 1981, JACM.

[11]  Matei A. Zaharia,et al.  An Architecture for and Fast and General Data Processing on Large Clusters , 2016 .

[12]  Fabrizio Silvestri,et al.  Caching query-biased snippets for efficient retrieval , 2011, EDBT/ICDT '11.

[13]  Norman May,et al.  The SAP HANA Database -- An Architecture Overview , 2012, IEEE Data Eng. Bull..

[14]  Aristides Gionis,et al.  Design trade-offs for search engine caching , 2008, TWEB.

[15]  Hugh E. Williams,et al.  Fast generation of result snippets in web search , 2007, SIGIR.

[16]  J. Shane Culpepper,et al.  Efficient in-memory top-k document retrieval , 2012, SIGIR '12.

[17]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[18]  W. Bruce Croft,et al.  Efficient document retrieval in main memory , 2007, SIGIR.

[19]  Michael Stonebraker,et al.  SciDB DBMS Research at M.I.T , 2013, IEEE Data Eng. Bull..

[20]  Jae-Gil Lee,et al.  Odysseus: a high-performance ORDBMS tightly-coupled with IR features , 2005, 21st International Conference on Data Engineering (ICDE'05).

[21]  Byeong-Soo Jeong,et al.  Inverted File Partitioning Schemes in Multiple Disk Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[22]  KozyrakisChristos,et al.  The case for RAMClouds , 2010 .

[23]  황규영,et al.  Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems , 2002 .

[24]  Özgür Ulusoy,et al.  Static query result caching revisited , 2008, WWW.

[25]  Veljko M. Milutinovic,et al.  Distributed shared memory: concepts and systems , 1997, IEEE Parallel Distributed Technol. Syst. Appl..

[26]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[27]  Il-Yeol Song,et al.  ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality , 2013, SIGMOD '13.

[28]  Aristides Gionis,et al.  The impact of caching on search engines , 2007, SIGIR.