PowerDB-IR: information retrieval on top of a database cluster

Our current concern is a scalable infrastructure for information retrieval (IR) with up-to-date retrieval results in the presence of frequent, continuous updates. Timely processing of updates is important with novel application domains, e.g., e-commerce. We want to use off-the-self hardware and software as much as possible. These issues are challenging, given the additional requirement that the resulting system must scale well. We have built PowerDB-IR, a system that has the characteristics sought. This paper describes its design, implementation, and evaluation. PowerDB-IR is a coordination layer for a database cluster. The rationale behind a database cluster is to 'scale-out', i.e., to add further cluster nodes, whenever necessary for better performance. We build on IR-to-database mappings and service decomposition to support high-level parallelism. We follow a three-tier architecture with the database cluster as the bottom layer for storage management. The middle tier provides IR-specific processing and update services. PowerDB-IR has the following features: It allows to insert and retrieve documents concurrently, and it ensures freshness with almost no overhead. Alternative physical data organization schemes provide adequate performance for different workloads. Query processing techniques for the different data organizations efficiently integrate the ranked retrieval results from the cluster nodes. We have run extensive experiments with our prototype using commercial database systems and middleware software products. The main result is that PowerDB-IR shows surprisingly ideal scalability and low response times.

[1]  Hans-Jörg Schek,et al.  Data Structures for an Integrated Data Base Management and Information Retrieval System , 1982, VLDB.

[2]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[3]  Samuel DeFazio Overview of the Full-Text Document Retrieval Benchmark , 1993, The Benchmark Handbook.

[4]  Erich J. Neuhold,et al.  Structured document storage and refined declarative and navigational access mechanisms in HyperStorM , 1997, The VLDB Journal.

[5]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.

[6]  Tom W. Keller,et al.  Data placement in Bubba , 1988, SIGMOD '88.

[7]  Ophir Frieder On the Integration of Structured Data and Text: A Review of the SIRE Architecture (invited talk) , 2000, DELOS.

[8]  Hans-Jörg Schek,et al.  High-level parallelisation in a database cluster: a feasibility study using document services , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Steve Kirsch Infoseek's experiences searching the internet , 1998, SIGF.

[10]  Oscar H. Ibarra,et al.  Toward a Scalable Distributed {WWW} Server on Workstation Clusters , 1997, J. Parallel Distributed Comput..

[11]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[12]  Hans-Jörg Schek,et al.  Scalable distributed query and update service implementations for XML document elements , 2001, Proceedings Eleventh International Workshop on Research Issues in Data Engineering. Document Management for Data Intensive Business and Scientific Applications. RIDE 2001.

[13]  Sharad Mehrotra,et al.  The Gold Text Indexing Engine , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[14]  Ophir Frieder,et al.  Integrating Structured Data and Text: A Relational Approach , 1997, J. Am. Soc. Inf. Sci..