论文信息 - ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality

ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality

Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. Regarding higher-level functionality, we can draw a parallel with the traditional O/S file system vs. DBMS. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS. To estimate the performance, we propose a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We argue that the model can accurately estimate the performance of a massively-parallel (e.g., 300-node) search engine using the experimental results obtained from a small-scale (e.g., 5-node) one. We show that the estimation error between the model and the actual experiment is less than 2.13% by observing that the bulk of the query processing time is spent at the slave (vs. at the master and network) and by estimating the time spent at the slave based on actual measurement. Using our model, we demonstrate a commercial-level scalability and performance of our architecture. Our proposed system ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion Web pages by using only 43,472 nodes with an average query response time of 194 ms. By using twice as many (86,944) nodes, ODYS can provide an average query response time of 148 ms. These results show that building a massively-parallel search engine using a parallel DBMS is a viable approach with advantages of supporting the high-level (i.e., DBMS-level), SQL-like programming interface.

[1] Maged M. Michael,et al. Scalability of the Nutch search engine , 2007, ICS '07.

[2] Luiz André Barroso,et al. Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[3] Jae-Gil Lee,et al. Odysseus: a high-performance ORDBMS tightly-coupled with IR features , 2005, 21st International Conference on Data Engineering (ICDE'05).

[4] Werner Vogels,et al. Dynamo: amazon's highly available key-value store , 2007, SOSP.

[5] Eric Brill,et al. Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[6] Samuel Madden,et al. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[7] Robert B. Cooper. Introduction to Queuing Theory , 1990 .

[8] GhemawatSanjay,et al. The Google file system , 2003 .

[9] Jeffrey Dean,et al. Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[10] B.P.H. Kemper. Mean sojourn time in a parallel queue , 2009 .

[11] Abraham Silberschatz,et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[12] 황규영,et al. Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems , 2002 .

[13] Jeffrey Dean,et al. Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[14] Michael Stonebraker,et al. MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[15] S. Khorsandi,et al. Queuing network modeling of a cluster-based parallel system , 2004, Proceedings. Seventh International Conference on High Performance Computing and Grid in Asia Pacific Region, 2004..

[16] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[17] Igor N. Kovalenko,et al. Introduction to Queuing Theory , 1989 .

[18] Jae-Gil Lee,et al. DB-IR integration using tight-coupling in the Odysseus DBMS , 2013, World Wide Web.

[19] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.