A Parallel Document Engine Built on Top of a Cluster of Databases - Design, Implementation, and Experiences -

We report on the implementation and evaluation of a document engine that supports many parallel search and concurrent insertion requests efficiently and that is scalable to growing numbers of such requests. We use a cluster of commodity database systems in a shared nothing architecture. We deploy previous results on multi-level transactions and decompose a service request into short parallel database transactions. A coordinator, implemented as an extension of a transaction processing monitor, routes the short transactions to the appropriate database system in the cluster, depending on the data distribution that we have chosen. We have paid much attention to the design and implementation of the coordinator to avoid that it becomes a bottleneck. That means that we implemented auxiliary functionality such as term extraction as services and distribute them over the cluster. Extensive experiments show the following: (1) A relatively small number of components already suffices to cope with high workloads. (2) The coordinator of the database cluster has minimal impact on CPU resource consumption and on response times. E.g., the response time overhead of the coordinator is in the order of milliseconds while the response time for retrieval and insertions remains within seconds even with 100 parallel search or insertion streams. This is rather unexpected since the coordinator performs signature-based predicate locking and writes additional logging information. We conclude that a database cluster with a coordinator on top is a good scalable infrastructure for complex application services.

[1]  Gerhard Weikum,et al.  Implementation and performance of multi-level transaction management in a multidatabase environment , 1995, Proceedings RIDE-DOM'95. Fifth International Workshop on Research Issues in Data Engineering-Distributed Object Management.

[2]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[3]  Ophir Frieder,et al.  Integrating structured data and text: a relational approach , 1997 .

[4]  David J. DeWitt,et al.  Parallel Database Systems: The Future of High Performance Database Processing 1 , 1992 .

[5]  Yuri Breitbart,et al.  Unifying Concurrency Control and Recovery of Transactions with Semantically Rich Operations , 1998, Theor. Comput. Sci..

[6]  Torsten Grabs,et al.  A document engine on a db cluster , 1999 .

[7]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[8]  Samuel DeFazio Overview of the Full-Text Document Retrieval Benchmark , 1993, The Benchmark Handbook.

[9]  Tom W. Keller,et al.  Data placement in Bubba , 1988, SIGMOD '88.

[10]  Chaitanya K. Baru,et al.  DB2 Parallel Edition , 1995, IBM Syst. J..

[11]  Hans-Jörg Schek,et al.  A Predicate Oriented Locking Approach for Integrated Information Systems , 1983, IFIP Congress.

[12]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[13]  Gerhard Weikum,et al.  Principles and realization strategies of multilevel transaction management , 1991, TODS.

[14]  Sharad Mehrotra,et al.  The Gold Text Indexing Engine , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[15]  Krithi Ramamritham,et al.  Efficient transaction support for dynamic information retrieval systems , 1996, SIGIR '96.

[16]  B. R. Badrinath,et al.  Performance evaluation of semantics-based multilevel concurrency control protocols , 1990, SIGMOD '90.

[17]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[18]  Hans-Jörg Schek,et al.  Extending TP-monitors for intra-transaction parallelism , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[19]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[20]  W. Bruce Croft,et al.  Fast Incremental Index-ing for Full-Text IR , 1994, Very Large Data Bases Conference.

[21]  Steve Kirsch Infoseek's experiences searching the internet , 1998, SIGF.