PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases

Our objective is a scalable infrastructure for information retrieval (IR) with up-to-date retrieval results in the presence of updates. Timely processing of updates is important with novel application domains such as e-commerce. These issues are challenging, given the additional requirement that the system must scale well. We have built PowerDB-IR, a system that has the characteristics sought. This article describes its design, implementation, and evaluation. We follow a three-tier architecture with a database cluster as the bottom layer for storage management. The rationale for a database cluster is to ‘scale out’, i.e., to add further cluster nodes, whenever necessary for better performance. The middle tier provides IR-specific retrieval and update services. We deploy state-of-the-art middleware software to coordinate the cluster and to invoke IR-specific components. PowerDB-IR extends the middleware layer with service decomposition and parallelisation. PowerDB-IR has the following features: It supports state-of-the-art retrieval models such as vector space retrieval. It allows documents to be inserted and retrieved concurrently and ensures up-to-date retrieval results with almost no overhead. PowerDB-IR ensures the correctness of global concurrency and recovery. Alternative physical data organisation schemes and respective query processing techniques provide adequate performance for different workloads and database sizes. Scaling out the database cluster yields higher throughput and lower response times. We have run extensive experiments with PowerDB-IR using several commercial database systems as well as different middleware products. Further experiments have quantified the effect of transactional guarantees on performance. The main result is that PowerDB-IR shows surprisingly good scalability and low response times.

[1]  Hans-Jörg Schek,et al.  Text Search Using Database Systems Revisited - Some Experiments , 1995, BNCOD.

[2]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[3]  Hans-Jörg Schek,et al.  Extending TP-monitors for intra-transaction parallelism , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  B. R. Badrinath,et al.  Performance evaluation of semantics-based multilevel concurrency control protocols , 1990, SIGMOD '90.

[6]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[7]  Ophir Frieder On the Integration of Structured Data and Text: A Review of the SIRE Architecture (invited talk) , 2000, DELOS.

[8]  Hans-Jörg Schek,et al.  Architectural Issues of Transaction Management in Multi-Layered Systems , 1984, VLDB.

[9]  Irving L. Traiger,et al.  The notions of consistency and predicate locks in a database system , 1976, CACM.

[10]  David J. Harper,et al.  ECLAIR: An Extensible Class Library for Information Retrieval , 1992, Comput. J..

[11]  Edward A. Fox,et al.  Research Contributions , 2014 .

[12]  Hans-Jörg Schek,et al.  Data Structures for an Integrated Data Base Management and Information Retrieval System , 1982, VLDB.

[13]  Torsten Grabs,et al.  A Parallel Document Engine Built on Top of a Cluster of Databases - Design, Implementation, and Experiences - , 2000, ICDE 2000.

[14]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[15]  Gerhard Weikum,et al.  Snowball: Scalable Storage on Networks of Workstations with Balanced Load , 1998, Distributed and Parallel Databases.

[16]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[17]  Gustavo Alonso,et al.  Transactions in Stack, Fork, and Join Composite Systems , 1999, ICDT.

[18]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[19]  Gerhard Weikum,et al.  Implementation and performance of multi-level transaction management in a multidatabase environment , 1995, Proceedings RIDE-DOM'95. Fifth International Workshop on Research Issues in Data Engineering-Distributed Object Management.

[20]  Gustavo Alonso,et al.  Correctness and parallelism in composite systems , 1997, PODS.

[21]  Michael Stonebraker,et al.  The POSTGRES next generation database management system , 1991, CACM.

[22]  Steve Kirsch Infoseek's experiences searching the internet , 1998, SIGF.

[23]  Sharad Mehrotra,et al.  The Gold Text Indexing Engine , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[24]  Yuri Breitbart,et al.  Unifying Concurrency Control and Recovery of Transactions with Semantically Rich Operations , 1998, Theor. Comput. Sci..

[25]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[26]  Patrick Valduriez,et al.  Transaction chopping: algorithms and performance studies , 1995, TODS.

[27]  Hamid Pirahesh,et al.  Extensions to Starburst: objects, types, functions, and rules , 1991, CACM.

[28]  Hans-Jörg Schek,et al.  Intra-Transaction Parallelism in the Mapping of an Object Model to a Relational Multi-Processor System , 1996, VLDB.

[29]  Patrick Valduriez,et al.  Prototyping Bubba, A Highly Parallel Database System , 1990, IEEE Trans. Knowl. Data Eng..

[30]  Hans-Jörg Schek,et al.  Concepts and Applications of Multilevel Transactions and Open Nested Transactions , 1992, Database Transaction Models for Advanced Applications.

[31]  Samuel DeFazio Overview of the Full-Text Document Retrieval Benchmark , 1993, The Benchmark Handbook.

[32]  Hans-Jörg Schek,et al.  A multi-level transaction approach to federated DBMS transaction management , 1991, [1991] Proceedings. First International Workshop on Interoperability in Multidatabase Systems.

[33]  Hans-Jörg Schek,et al.  A Predicate Oriented Locking Approach for Integrated Information Systems , 1983, IFIP Congress.

[34]  Hans-Jörg Schek,et al.  Evaluating the Coordination Overhead of Replica Maintenance in a Cluster of Databases , 2000, Euro-Par.

[35]  Oscar H. Ibarra,et al.  Toward a Scalable Distributed {WWW} Server on Workstation Clusters , 1997, J. Parallel Distributed Comput..

[36]  Peter Dadam,et al.  A DBMS prototype to support extended NF2 relations: an integrated view on flat tables and hierarchies , 1986, SIGMOD '86.

[37]  Kotagiri Ramamohanarao,et al.  Atlas: A Nested Relational Database System for Text Applications , 1995, IEEE Trans. Knowl. Data Eng..

[38]  Tom W. Keller,et al.  Data placement in Bubba , 1988, SIGMOD '88.

[39]  Chaitanya K. Baru,et al.  DB2 Parallel Edition , 1995, IBM Syst. J..

[40]  Sharad Mehrotra,et al.  Efficient concurrency control in multidimensional access methods , 1999, SIGMOD '99.

[41]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[42]  Gerhard Weikum,et al.  Principles and realization strategies of multilevel transaction management , 1991, TODS.

[43]  Erich J. Neuhold,et al.  Structured document storage and refined declarative and navigational access mechanisms in HyperStorM , 1997, The VLDB Journal.

[44]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.

[45]  Ophir Frieder,et al.  Integrating structured data and text: a relational approach , 1997 .

[46]  M. Tamer Özsu,et al.  An object-oriented multimedia database system for a news-on-demand application , 1995, Multimedia Systems.

[47]  Krithi Ramamritham,et al.  Efficient transaction support for dynamic information retrieval systems , 1996, SIGIR '96.

[48]  Gustavo Alonso,et al.  Correctness in general configurations of transactional components , 1999, PODS '99.

[49]  Patrick Valduriez,et al.  Principles of distributed database systems (2nd ed.) , 1999 .

[50]  Gerhard Weikum,et al.  Data partitioning and load balancing in parallel disk systems , 1998, The VLDB Journal.

[51]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[52]  Hans-Jörg Schek,et al.  High-level parallelisation in a database cluster: a feasibility study using document services , 2001, Proceedings 17th International Conference on Data Engineering.

[53]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[54]  Hans-Jörg Schek,et al.  PowerDB-IR: information retrieval on top of a database cluster , 2001, CIKM '01.