论文信息 - Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine

Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine

We propose two-dimensional indexing—a novel in-memory indexing architecture that operates over distributed memory of a massively-parallel search engine. The goal of two-dimensional indexing is to provide a one-integrated-memory view as in a single node system using one large integrated memory. In two-dimensional indexing, we partition the entire index into n× m fragments and distribute them over the memories of multiple nodes in such a way that each fragment is entirely stored in main memory of one node. The proposed architecture is not only scalable as it uses a scaled-out shared-nothing architecture but also is capable of achieving low query response time as it processes queries in main memory. We also propose the concept of the one-memory point, which is the amount of the memory space required to completely store the entire index in main memory providing a one-integrated-memory view. We first prove the effectiveness of two-dimensional indexing with single-keyword queries, and then, extend the notion so as to be able to handle multiple-keyword queries. To handle multiple-keyword queries, we adopt pre-join that materializes a multiple-keyword query a priori as well as a new notion of semi-memory join that obviates extensive communication overhead to perform join across multiple nodes. In experiments using the real-life search query set over a database consisting of 100 million Web documents crawled, we show that two-dimensional indexing can effectively provide a one-integrated-memory view without too much of additional memory compared with the single node system using one large integrated memory. We also show that, with a six-node prototype, in an ideal case, it significantly improves the query processing performance over a disk-based search engine with an equivalent amount of in-memory buffer but without two-dimensional indexing — by up to 535.54 times. This improvement is expected to get larger as the system is scaled-out with a larger number of machines.

[1] Michael Stonebraker,et al. The VoltDB Main Memory DBMS , 2013, IEEE Data Eng. Bull..

[2] Scott Shenker,et al. Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[3] Jan O. Pedersen,et al. Optimization for dynamic inverted index maintenance , 1989, SIGIR '90.

[4] Abdur Chowdhury,et al. A picture of search , 2006, InfoScale '06.

[5] Torsten Suel,et al. Improved techniques for result caching in web search engines , 2009, WWW '09.

[6] Jae-Gil Lee,et al. DB-IR integration using tight-coupling in the Odysseus DBMS , 2013, World Wide Web.

[7] Parag Agrawal,et al. The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[8] Fabrizio Silvestri,et al. Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data , 2006, TOIS.

[9] Ricardo Baeza-Yates,et al. ResIn: a combination of results caching and index pruning for high-performance web search engines , 2008, SIGIR '08.

[10] Philip A. Bernstein,et al. Using Semi-Joins to Solve Relational Queries , 1981, JACM.

[11] Matei A. Zaharia,et al. An Architecture for and Fast and General Data Processing on Large Clusters , 2016 .

[12] Fabrizio Silvestri,et al. Caching query-biased snippets for efficient retrieval , 2011, EDBT/ICDT '11.

[13] Norman May,et al. The SAP HANA Database -- An Architecture Overview , 2012, IEEE Data Eng. Bull..

[14] Aristides Gionis,et al. Design trade-offs for search engine caching , 2008, TWEB.

[15] Hugh E. Williams,et al. Fast generation of result snippets in web search , 2007, SIGIR.

[16] J. Shane Culpepper,et al. Efficient in-memory top-k document retrieval , 2012, SIGIR '12.

[17] Evangelos P. Markatos,et al. On caching search engine query results , 2001, Comput. Commun..

[18] W. Bruce Croft,et al. Efficient document retrieval in main memory , 2007, SIGIR.

[19] Michael Stonebraker,et al. SciDB DBMS Research at M.I.T , 2013, IEEE Data Eng. Bull..

[20] Jae-Gil Lee,et al. Odysseus: a high-performance ORDBMS tightly-coupled with IR features , 2005, 21st International Conference on Data Engineering (ICDE'05).

[21] Byeong-Soo Jeong,et al. Inverted File Partitioning Schemes in Multiple Disk Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[22] KozyrakisChristos,et al. The case for RAMClouds , 2010 .

[23] 황규영,et al. Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems , 2002 .

[24] Özgür Ulusoy,et al. Static query result caching revisited , 2008, WWW.

[25] Veljko M. Milutinovic,et al. Distributed shared memory: concepts and systems , 1997, IEEE Parallel Distributed Technol. Syst. Appl..

[26] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[27] Il-Yeol Song,et al. ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality , 2013, SIGMOD '13.

[28] Aristides Gionis,et al. The impact of caching on search engines , 2007, SIGIR.